Removing XML subelement tags with Python using elementTree and .remove()
Asked Answered
S

4

7

I need help adjusting my XML file with Python and the elementTree library.

For some background, I am not a student and work in industry. I hope to save myself a great deal of manual effort by making these changes automated and typically I would have just done this in a language such as C++ that I am more familiar with. However, there is a push to use Python in my group so I am using this as both a functional and learning exercise.

Could you please correct my use of terms and understanding? I do not simply just want the code to work, but to know that my understanding of how it works is correct.

The Problem itself:

Goal: remove the sub-element "weight" from the XML file.

Using the xml code (let's just say it is called "example.xml"):

<XML_level_1 created="2014-08-19 16:55:02" userID="User@company">
<XML_level_2 manufacturer="company" number="store-25235">
  <padUnits value="mm" />
  <partDescription value="Part description explained here" />
  <weight value="5.2" />
</XML_level_2>
</XML_level_1>

Thus far, I have the following code:

from xml.etree import ElementTree

current_xml_tree = ElementTree.parse(filename_path) # Path to example.xml

current_xml_root = current_xml_tree.getroot()
current_xml_level_2_node = current_xml_root.findall('XML_level_2')

# Extract "weight" value for later use
for weight_value_elem in current_xml_root.iter('weight'):
    weight_value = weight_value_elem.get('value')

# Remove weight sub-element from XML
# -------------------------------------

# Get all nodes entitled 'weight' from element
weight_nodes = current_xml_root.findall('weight')
print weight_nodes     # result is an empty list

print weight_value_elem    # Location of element 'weight' is listed

for weight_node_loc in current_xml_tree.iter('weight'):
    print "for-loop check : loop has been entered"

    current_xml_tree.getroot().remove(weight_value_elem)
    print "for-loop has been processed"

print "Weight line removed from ", filename_path

# Write changes to XML File:
current_xml_tree.write(filename_path)

I have read this helpful resource, but have reached a point where I am stuck.

Second question: What is the relation of nodes and elements in this context?

I come from a finite element background, where nodes are understood as part of an element, defining portions / corner boundaries of what creates an element. However, am I wrong in thinking the terminology is used differently here so that nodes are not a subset of elements? Are the two terms still related in a similar way?

Stableboy answered 20/5, 2016 at 0:1 Comment(5)
FYI: en.wikipedia.org/wiki/XML#Key_terminologyBorkowski
Thank you for the link, @Robᵩ ! :) I was just looking into this more, since the terminology differences are nearly opposite. Seeing another link on Stackoverflow, I found this post where a quoted post refers to this as: "The same as between fruit and apple. Every XmlElement is XmlNode, but not every XmlNode is XmlElement. XmlElement is just one kind of XmlNode. Others are XmlAttribute, XmlText etc." So basically, elements in XML are always a subset of nodes?Stableboy
That post explains what "Node" means in the DOM. You aren't using DOM, so it's explanation doesn't apply. To the best of my knowledge, "node" doesn't have a technical meaning either in XML nor in the xml.etree.ElementTree API. The ElementTree API docs do use the word, but only in the graph-theory sense: a tree consists of a hierarchy of nodes connected by parent-child relationships. In the ElementTree API, the tree represents the structure of the XML doc, and each node represents an XML element.Borkowski
Also, since you are using Python 2.7, you might find the offical 2.7 ElementTree documentation more useful than the unofficial 3.4 version you linked to. Official 2.7: docs.python.org/2/library/xml.etree.elementtree.html Official 3.5: docs.python.org/3.5/library/xml.etree.elementtree.htmlBorkowski
Ah, okay! Using it that way in a graph-theory setting makes much more sense. I'll definitely tread carefully where terminology overlaps in the future, just to be safe. Thank you for the explanation and links!Stableboy
M
13

Removing an element from a tree, regardless of its location in the tree, is needlessly complicated by the ElementTree API. Specifically, no element knows its own parent, so we have to discover that relationship "by hand."

from xml.etree import ElementTree
XML = '''
    <XML_level_1 created="2014-08-19 16:55:02" userID="User@company">
    <XML_level_2 manufacturer="company" number="store-25235">
      <padUnits value="mm" />
      <partDescription value="Part description explained here" />
      <weight value="5.2" />
    </XML_level_2>
    </XML_level_1>
'''

# parse the XML into a tree
root = ElementTree.XML(XML)

# Alternatively, parse the XML that lives in 'filename_path'
# tree = ElementTree.parse(filename_path)
# root = tree.getroot()

# Find the parent element of each "weight" element, using XPATH
for parent in root.findall('.//weight/..'):
    # Find each weight element
    for element in parent.findall('weight'):
        # Remove the weight element from its parent element
        parent.remove(element)

print ElementTree.tostring(root)

If you can switch to lxml, the loop is slightly less cumbersome:

for weight in tree.findall("//weight"):
  weight.getparent().remove(weight)

As to your second question, the ElementTree documentation uses "node" more-or-less interchangably with "element." More specifically, it appears to use the word "node" to refer either to a Python object of type "Element" or the XML element to which such an object refers.

Middlesworth answered 20/5, 2016 at 0:25 Comment(4)
Hi @Middlesworth First off, thank you for taking the time to reply to this level of depth and explanation so quickly. It is much appreciated and definitely helps me to learn rather than just patch the problem and move on. Just to make sure I'm understanding correctly of the method you showed using XPATH: for parent in root.findall('.//weight/..'): * This code line takes './/weight/..' with the understanding of taking the current location, selecting the parent element, finds and selects 'weight', and then weight's node.Stableboy
(continued) Having selected the parent element, since Python will not inherently know this, we then have to select the actual element for 'weight' before it can be removed. I can definitely see how using the lxml library would be much less complicated than this way. It seems more intuitive for a program like Python, where as if I wanted true speed I would use C or even Fortran.Stableboy
Also, thank you again for explaining more about 'nodes' and 'elements' in this context! That is the kind of information that is rather hard to gain when you're teaching yourself.Stableboy
lxml is definitively a good solution, I wasn't aware of its existence and thank you for having make me discover itThinking
N
5

Your problem is that node.remove() only removes direct subelements of node. In the XML-file you posted the weight element is no direct subelement of XML_level_1 but a direct subelement of XML_level_2. Also the way ElementTree is implemented it seems there is no link from a child to its parent.

You could change your code as follows:

from xml.etree import ElementTree

xml_str = '''
    <XML_level_1 created="2014-08-19 16:55:02" userID="User@company">
        <XML_level_2 manufacturer="company" number="store-25235">
            <padUnits value="mm" />
            <partDescription value="Part description explained here" />
            <weight value="5.2" />
        </XML_level_2>
    </XML_level_1>
'''    

root = ElementTree.fromstring(xml_str)

for elem in root.iter():
    for child in list(elem):
        if child.tag == 'weight':
            elem.remove(child)

Explanation: root.iter() iterates over the entire tree in depth first order and list(elem) lists all children of a particular element. You then filter out the elements with name (tag) weight and thus have references to both parent and child and thus can now remove an element.

The Library seems to make no particular distinction between node and element although you would only find the term element in an XML context.

Each XML document has both a logical and a physical structure. Physically, the document is composed of units called entities. An entity may refer to other entities to cause their inclusion in the document. A document begins in a "root" or document entity. Logically, the document is composed of declarations, elements, comments, character references, and processing instructions, all of which are indicated in the document by explicit markup. The logical and physical structures must nest properly, as described in 4.3.2 Well-Formed Parsed Entities.

Nicolella answered 20/5, 2016 at 0:42 Comment(2)
Hi @siwica, Thanks for your answer as well! It definitely helps to learn that "node.remove()" can only work with one level of depth at a time. In that way it is much like a bulldozer, rather than a backhoe with data. Your method is also the easiest to implement quickly in order to make the code work and readable on-the-go. :) However, I really appreciate seeing both yours and Rob's methods so I can store them for future use mentally.Stableboy
That is useful insight with the structural terminology of XML. It seemed intuitive at first compared to my initial feelings with encountering most languages, but I believe I'll set aside some time this weekend to read through "4.3.2 Well-Formed Parsed Entities" :)Stableboy
M
2

If you know that you only have one instance of the weight tag, you can avoid the pain of looping and just find the parent and child elements, then remove the child, eg:

xml_root = ElementTree.parse(filename_path).getroot() # Path to example.xml
parent_element = xml_root.find('./XML_level_2')
weight_element = xml_root.find('./XML_level_2/weight')
parent_element.remove(weight_element)

Mogul answered 7/2, 2019 at 11:40 Comment(0)
B
1

To add one more term in your growing vocabulary, consider XSLT, the special-purpose declarative language designed to transform XML documents for various end use needs. In fact, XSLT is a well-formed XML file carrying scripting instructions! While Python's built-in xml.etree does not have an XSLT processor, the external lxml (based on libxslt) module maintains an XSLT 1.0 processor. Even more, XSLT is portable and can be used by other languages (Java, PHP, Perl, VB, even C++) or even dedicated executables (Saxon, Xalan) and command line interpreters (Bash, PowerShell).

You will notice below, not one loop is used. In the XSLT script, the Identity Transform copies entire document as is and the empty template match to weight (wherever it is located) removes it.

import lxml.etree as ET

xml_str = '''
    <XML_level_1 created="2014-08-19 16:55:02" userID="User@company">
        <XML_level_2 manufacturer="company" number="store-25235">
            <padUnits value="mm" />
            <partDescription value="Part description explained here" />
            <weight value="5.2" />
        </XML_level_2>
    </XML_level_1>
'''
dom = ET.fromstring(xml_str)

xslt_str = '''
    <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output version="1.0" encoding="UTF-8" indent="yes" />
    <xsl:strip-space elements="*"/> 

      <!-- Identity Transform -->
      <xsl:template match="@*|node()">
        <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
      </xsl:template>    

      <!-- Empty Template -->
      <xsl:template match="weight"/>    
    </xsl:transform>
'''
xslt = ET.fromstring(xslt_str)

transform = ET.XSLT(xslt)                          # INITIALIZES TRANSFORMER
newdom = transform(dom)                            # RUNS TRANSFORMATION ON SOURCE XML
tree_out = ET.tostring(newdom, pretty_print=True)  # CONVERTS TREE OBJECT TO STRING
print(tree_out.decode("utf-8"))
Bloodhound answered 20/5, 2016 at 2:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.