I have been transforming some of my original xml.etree.ElementTree
(ET
) code to lxml.etree
(lxmlET
). Luckily there are a lot of similarities between the two. However, I did stumble upon some strange behaviour that I cannot find written down in any documentation. It considers the internal representation of descendant nodes.
In ET, iter()
is used to iterate over all descendants of an Element, optionally filtered by tag name. Because I could not find any details about this in the documentation, I expected similar behaviour for lxmlET. The thing is that from testing I conclude that in lxmlET, there is a different internal representation of a tree.
In the example below, I iterate over nodes in a tree and print each node's children, but in addition I also create all different combinations of those children and print those. This means, if an element has children ('A', 'B', 'C')
I create alterations, namely trees [('A'), ('A', 'B'), ('A', 'C'), ('B'), ('B', 'C'), ('C')]
.
# import lxml.etree as ET
import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy
def get_combination_trees(tree):
children = list(tree)
for i in range(1, len(children)):
for combination in combinations(children, i):
new_combo_tree = ET.Element(tree.tag, tree.attrib)
for recombined_child in combination:
new_combo_tree.append(recombined_child)
# when using lxml a deepcopy is required to make this work (or make change in parse_xml)
# new_combo_tree.append(deepcopy(recombined_child))
yield new_combo_tree
return None
def parse_xml(tree_p):
for node in ET.fromstring(tree_p):
if not node.tag == 'node_main':
continue
# replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
for subnode in node.iter('node'):
children = list(subnode)
if children:
print('-'.join([child.attrib['id'] for child in children]))
else:
print(f'node {subnode.attrib["id"]} has no children')
for combo_tree in get_combination_trees(subnode):
combo_children = list(combo_tree)
if combo_children:
print('-'.join([child.attrib['id'] for child in combo_children]))
return None
s = '''<root>
<node_main>
<node id="1">
<node id="2" />
<node id="3">
<node id="4">
<node id="5" />
</node>
<node id="6" />
</node>
</node>
</node_main>
</root>
'''
parse_xml(s)
The expected output here is the id's of the children of each node joined together with a hyphen, and also all possible combinations of the children (cf. supra) in a top-down breadth-first fashion.
2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children
However, when you use the lxml
module instead of xml
(uncomment the import for lxmlET and comment the import for ET), and run the code you'll see that the output is
2-3
2
3
node 2 has no children
So the deeper descendant nodes are never visited. This can be circumvented by either:
- using
deepcopy
(comment/uncomment relevant part inget_combination_trees()
), or - using
for subnode in node.xpath('.//node')
inparse_xml()
instead ofiter()
.
So I know that there is a way around this, but I am mainly wondering what is happening?! It took me ages to debug this, and I can't find any documentation on it. What is going on, what is the actual underlying difference here between the two modules? And what is the most efficient work-around when working with very large trees?
new_combo_tree
element? Note that the behavioral difference doesn't necessarily imply a difference in internal representation. – Peltrydeepcopy
to create node copies (compare towards the end of "Elements are Lists" in the docs). If you just need combinations and want optimal performance, work with lists of node references. – Zymogenlxml
vsElementTree
): https://mcmap.net/q/299618/-what-are-the-differences-between-lxml-and-elementtree/1959808 – Aleedis