Testing Equivalence of xml.etree.ElementTree
Asked Answered
A

6

36

I'm interested in equivalence of two xml elements; and I've found that testing the tostring of the elements works; however, that seems hacky.

Is there a better way to test equivalence of two etree Elements?

Comparing Elements directly:

import xml.etree.ElementTree as etree
h1 = etree.Element('hat',{'color':'red'})
h2 = etree.Element('hat',{'color':'red'})

h1 == h2  # False

Comparing Elements as strings:

etree.tostring(h1) == etree.tostring(h2)  # True
Apiarist answered 26/10, 2011 at 15:57 Comment(1)
A function to compare two Elements can be found in Itamar's answer below.Freddafreddi
R
42

This compare function works for me:

def elements_equal(e1, e2):
    if e1.tag != e2.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all(elements_equal(c1, c2) for c1, c2 in zip(e1, e2))
Recountal answered 22/6, 2014 at 9:29 Comment(5)
This is a solution. Make sure that whitespace does not interfere, e.g. by using etree.XMLParser(remove_blank_text=True). Improve by avoiding to build the list in all(). Note that zip() works since len() was tested before.Freddafreddi
Neat! This seems to work regardless of element order, even for elements with the same tagnames.Christchurch
This does not work regardless of element order. For the same element with sub-elements in different order the zip will match potentially differing elements resulting in a False comparison.Eustashe
@Eustashe If the element order differs, you'd want the comparison to return False, wouldn't you? Attribute order is a different story.Overprize
Agreed - the comment I was replying to implied that it would. Some applications wouldn't care for order, however.Eustashe
O
10

Comparing strings doesn't always work. The order of the attributes should not matter for considering two nodes equivalent. However, if you do string comparison, the order obviously matters.

I'm not sure if it is a problem or a feature, but my version of lxml.etree preserves the order of the attributes if they are parsed from a file or a string:

>>> from lxml import etree
>>> h1 = etree.XML('<hat color="blue" price="39.90"/>')
>>> h2 = etree.XML('<hat price="39.90" color="blue"/>')
>>> etree.tostring(h1) == etree.tostring(h2)
False

This might be version-dependent (I use Python 2.7.3 with lxml.etree 2.3.2 on Ubuntu); I remember that I couldn't find a way of controlling the order of the attributes a year ago or so, when I wanted to (for readability reasons).

As I need to compare XML files that were produced by different serializers, I see no other way than recursively comparing tag, text, attributes, and children of every node. And of course tail, if there's anything interesting there.

Comparison of lxml and xml.etree.ElementTree

The truth is that it may be implementation dependent. Apparently, lxml uses ordered dict or something like that, the standard xml.etree.ElementTree does not preserve the order of attributes:

Python 2.7.1 (r271:86832, Nov 27 2010, 17:19:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> h1 = etree.XML('<hat color="blue" price="39.90"/>')
>>> h2 = etree.XML('<hat price="39.90" color="blue"/>')
>>> etree.tostring(h1) == etree.tostring(h2)
False
>>> etree.tostring(h1)
'<hat color="blue" price="39.90"/>'
>>> etree.tostring(h2)
'<hat price="39.90" color="blue"/>'
>>> etree.dump(h1)
<hat color="blue" price="39.90"/>>>> etree.dump(h2)
<hat price="39.90" color="blue"/>>>>

(Yes, the newlines are missing. But it is a minor problem.)

>>> import xml.etree.ElementTree as ET
>>> h1 = ET.XML('<hat color="blue" price="39.90"/>')
>>> h1
<Element 'hat' at 0x2858978>
>>> h2 = ET.XML('<hat price="39.90" color="blue"/>')
>>> ET.dump(h1)
<hat color="blue" price="39.90" />
>>> ET.dump(h2)
<hat color="blue" price="39.90" />
>>> ET.tostring(h1) == ET.tostring(h2)
True
>>> ET.dump(h1) == ET.dump(h2)
<hat color="blue" price="39.90" />
<hat color="blue" price="39.90" />
True

Another question may be what is considered unimportant whan comparing. For example, some fragments may contain extra spaces and we do not want to care. This way, it is always better to write some serializing function that works exactly we need.

Overprize answered 25/9, 2012 at 21:3 Comment(2)
.dump(...) returns None, so ET.dump(h1) == ET.dump(h2) is actually comparing None to None.Quilmes
about the attribute order: feature, read FAQ How can I sort the attributes? lxml.de/FAQ.html#how-can-i-sort-the-attributesChilopod
S
4

Serializing and deserializing won't work for XML because attributes are not order dependent (and other reasons) E.g. these two elements are logically the same, but different strings:

<THING a="foo" b="bar"></THING>
<THING b="bar" a="foo"  />

Exactly how to do an element comparison is tricky. As far as I can tell, there is nothing built into Element Tree to do this for you. I needed to do this myself, and used the code below. It works for my needs, but its not suitable for large XML structures and is not fast or efficient! This is an ordering function rather than an equality function, so a result of 0 is equal and anything else is not. Wrapping it with a True or False returning function is left as an exercise for the reader!

def cmp_el(a,b):
    if a.tag < b.tag:
        return -1
    elif a.tag > b.tag:
        return 1
    elif a.tail < b.tail:
        return -1
    elif a.tail > b.tail:
        return 1

    #compare attributes
    aitems = a.attrib.items()
    aitems.sort()
    bitems = b.attrib.items()
    bitems.sort()
    if aitems < bitems:
        return -1
    elif aitems > bitems:
        return 1

    #compare child nodes
    achildren = list(a)
    achildren.sort(cmp=cmp_el)
    bchildren = list(b)
    bchildren.sort(cmp=cmp_el)

    for achild, bchild in zip(achildren, bchildren):
        cmpval = cmp_el(achild, bchild)
        if  cmpval < 0:
            return -1
        elif cmpval > 0:
            return 1    

    #must be equal 
    return 0
Stallfeed answered 28/8, 2013 at 12:55 Comment(1)
The main cause of problem in comparing two XML files is different formatting like what he said above. And, most of the time the pronlem lies in spaces or newlines in the tail section. I had two logically identical XML files for test and the code did not find out that they are the same. But, I just removed the .tail comparison from the code, and it worked like a charm!Unknit
A
3

Believe it or not that is actually the best way to handle comparing two nodes if you don't know how many children each may have and you want to include all children in the search.

Of course, if you simply have a childless node like the one you are demonstrating, you can simply compare the tag, attrib, and tail properties:

if h1.tag == h2.tag and h1.attrib == h2.attrib and h1.tail == h2.tail:
    print("h1 and h2 are the same")
else
    print("h1 and h2 are the different")

I don't see any major benefit of this over using tostring, however.

Antihistamine answered 26/10, 2011 at 16:59 Comment(1)
You can also throw in text according to your needs: h1.text == h2.textDyspeptic
D
2

An usual way to compare complex structures is to dump them in a common unique textual representation and compare the resulting strings for equality.

To compare two received json strings, you would convert them to json objects, and then convert them back to strings (with the same convertor) and compare. I did it to check json feeds, it works well.

For XML, it is almost the same, but you may have to handle (strip? remove?) the ".text" parts (the text, blank or not, that may be found outside tags).

So in short, your solution is not a hack, as long as you make sure two equivalent XMLs (according to your context) will have the same string representation.

Decathlon answered 26/10, 2011 at 16:21 Comment(0)
N
-1

Do not gold plate. The one you have is a good comparison. At the end XML it is TEXT.

Nawab answered 26/10, 2011 at 16:0 Comment(1)
Yes, and if you are concerned about formatting, convert to ET, then dump to string and compare.Rozanneroze

© 2022 - 2024 — McMap. All rights reserved.