Can ElementTree be told to preserve the order of attributes?
Asked Answered
U

12

34

I've written a fairly simple filter in python using ElementTree to munge the contexts of some xml files. And it works, more or less.

But it reorders the attributes of various tags, and I'd like it to not do that.

Does anyone know a switch I can throw to make it keep them in specified order?

Context for this

I'm working with and on a particle physics tool that has a complex, but oddly limited configuration system based on xml files. Among the many things setup that way are the paths to various static data files. These paths are hardcoded into the existing xml and there are no facilities for setting or varying them based on environment variables, and in our local installation they are necessarily in a different place.

This isn't a disaster because the combined source- and build-control tool we're using allows us to shadow certain files with local copies. But even thought the data fields are static the xml isn't, so I've written a script for fixing the paths, but with the attribute rearrangement diffs between the local and master versions are harder to read than necessary.


This is my first time taking ElementTree for a spin (and only my fifth or sixth python project) so maybe I'm just doing it wrong.

Abstracted for simplicity the code looks like this:

tree = elementtree.ElementTree.parse(inputfile)
i = tree.getiterator()
for e in i:
    e.text = filter(e.text)
tree.write(outputfile)

Reasonable or dumb?


Related links:

Undistinguished answered 29/4, 2010 at 23:48 Comment(5)
is there no real solution to this? etree in python 3.4 does not preserve attributes? or does it with some settings?? Thanks for the help!Openair
@Openair Look at the accepted answer...Undistinguished
i thought of an non-monkey-patch solution =)? sadly as it looks there is nothing better for now... this question is especially relevant if the XML should stay hand-editable, and user-friendy to read, I almost think I am going for regex substitutions to modify the xml, sucks but, the layout is then preserved (also formattings like indentations and linebreaks)Openair
If your goal is a reasonable diff, consider keeping the canonical copy of your file in c14n format. That way you can re-canonicalize any modified version and get a diff that only includes semantically-relevant changes.Xiphoid
It isn't documented anywhere, but apparently python 3.8 fixes this.Extant
P
25

With help from @bobince's answer and these two (setting attribute order, overriding module methods)

I managed to get this monkey patched it's dirty and I'd suggest using another module that better handles this scenario but when that isn't a possibility:

# =======================================================================
# Monkey patch ElementTree
import xml.etree.ElementTree as ET

def _serialize_xml(write, elem, encoding, qnames, namespaces):
    tag = elem.tag
    text = elem.text
    if tag is ET.Comment:
        write("<!--%s-->" % ET._encode(text, encoding))
    elif tag is ET.ProcessingInstruction:
        write("<?%s?>" % ET._encode(text, encoding))
    else:
        tag = qnames[tag]
        if tag is None:
            if text:
                write(ET._escape_cdata(text, encoding))
            for e in elem:
                _serialize_xml(write, e, encoding, qnames, None)
        else:
            write("<" + tag)
            items = elem.items()
            if items or namespaces:
                if namespaces:
                    for v, k in sorted(namespaces.items(),
                                       key=lambda x: x[1]):  # sort on prefix
                        if k:
                            k = ":" + k
                        write(" xmlns%s=\"%s\"" % (
                            k.encode(encoding),
                            ET._escape_attrib(v, encoding)
                            ))
                #for k, v in sorted(items):  # lexical order
                for k, v in items: # Monkey patch
                    if isinstance(k, ET.QName):
                        k = k.text
                    if isinstance(v, ET.QName):
                        v = qnames[v.text]
                    else:
                        v = ET._escape_attrib(v, encoding)
                    write(" %s=\"%s\"" % (qnames[k], v))
            if text or len(elem):
                write(">")
                if text:
                    write(ET._escape_cdata(text, encoding))
                for e in elem:
                    _serialize_xml(write, e, encoding, qnames, None)
                write("</" + tag + ">")
            else:
                write(" />")
    if elem.tail:
        write(ET._escape_cdata(elem.tail, encoding))

ET._serialize_xml = _serialize_xml

from collections import OrderedDict

class OrderedXMLTreeBuilder(ET.XMLTreeBuilder):
    def _start_list(self, tag, attrib_in):
        fixname = self._fixname
        tag = fixname(tag)
        attrib = OrderedDict()
        if attrib_in:
            for i in range(0, len(attrib_in), 2):
                attrib[fixname(attrib_in[i])] = self._fixtext(attrib_in[i+1])
        return self._target.start(tag, attrib)

# =======================================================================

Then in your code:

tree = ET.parse(pathToFile, OrderedXMLTreeBuilder())
Procyon answered 17/6, 2015 at 21:19 Comment(6)
Wow. In the years since I asked this question the offending tool has been re-structured to allow persistent local overrides so that my original need has disappeared and I've moved on to different, if not greener, pastures and don't even use the fixed version any more. None the less, I am sure that someone still has this need.Undistinguished
@dmckee : you are totally right. This question is still relevant and the patch can't be the correct way to solve this .Boater
is there a solution now for python 3.4? Did the etree implementation change to allow this?Openair
"Another module that better handles this scenario" Do you have any specific ones in mind?Standin
Note: patching ET._serialize_xml is NOT enough if you want root node atributes to preserve the order as well! Also put the patched _serialize_xml into ET._serialize['xml'] and Voilà you got that too!! :]Manicotti
The answer below https://mcmap.net/q/426148/-can-elementtree-be-told-to-preserve-the-order-of-attributes is a much simpler "monkey-patch" to preserve the output order. I point out this does not fix round trip issues (parsing into element tree then outputting) but I thik neither does this answer.Siccative
R
19

Nope. ElementTree uses a dictionary to store attribute values, so it's inherently unordered.

Even DOM doesn't guarantee you attribute ordering, and DOM exposes a lot more detail of the XML infoset than ElementTree does. (There are some DOMs that do offer it as a feature, but it's not standard.)

Can it be fixed? Maybe. Here's a stab at it that replaces the dictionary when parsing with an ordered one (collections.OrderedDict()).

from xml.etree import ElementTree
from collections import OrderedDict
import StringIO

class OrderedXMLTreeBuilder(ElementTree.XMLTreeBuilder):
    def _start_list(self, tag, attrib_in):
        fixname = self._fixname
        tag = fixname(tag)
        attrib = OrderedDict()
        if attrib_in:
            for i in range(0, len(attrib_in), 2):
                attrib[fixname(attrib_in[i])] = self._fixtext(attrib_in[i+1])
        return self._target.start(tag, attrib)

>>> xmlf = StringIO.StringIO('<a b="c" d="e" f="g" j="k" h="i"/>')

>>> tree = ElementTree.ElementTree()
>>> root = tree.parse(xmlf, OrderedXMLTreeBuilder())
>>> root.attrib
OrderedDict([('b', 'c'), ('d', 'e'), ('f', 'g'), ('j', 'k'), ('h', 'i')])

Looks potentially promising.

>>> s = StringIO.StringIO()
>>> tree.write(s)
>>> s.getvalue()
'<a b="c" d="e" f="g" h="i" j="k" />'

Bah, the serialiser outputs them in canonical order.

This looks like the line to blame, in ElementTree._write:

            items.sort() # lexical order

Subclassing or monkey-patching that is going to be annoying as it's right in the middle of a big method.

Unless you did something nasty like subclass OrderedDict and hack items to return a special subclass of list that ignores calls to sort(). Nah, probably that's even worse and I should go to bed before I come up with anything more horrible than that.

Ryle answered 30/4, 2010 at 1:16 Comment(1)
Very nice OrderedXmlTreeBuilder in code above! It can be used with ltree and serialization will be fixed too. Thank you very much for this.Gram
N
13

Best Option is to use the lxml library http://lxml.de/ Installing the lxml and just switching the library did the magic to me.

#import xml.etree.ElementTree as ET
from lxml import etree as ET
Nagpur answered 23/1, 2018 at 17:32 Comment(3)
thdox already posted that suggestion.Undistinguished
@dmckee : you are right. I totally missed that answer.Nagpur
It did work for me as well, Thanks a lot for the answer.Lolitaloll
D
9

Yes, with lxml

>>> from lxml import etree
>>> root = etree.Element("root", interesting="totally")
>>> etree.tostring(root)
b'<root interesting="totally"/>'
>>> print(root.get("hello"))
None
>>> root.set("hello", "Huhu")
>>> print(root.get("hello"))
Huhu
>>> etree.tostring(root)
b'<root interesting="totally" hello="Huhu"/>'

Here is direct link to documentation, from which the above example is slightly adapted.

Also note that lxml has, by design, some good API compatiblity with standard xml.etree.ElementTree

Deflower answered 1/1, 2016 at 21:50 Comment(6)
Are you sure that lxml preserves the attribute order? The documentation seems to say the opposite.Armored
From the documentation, I simplified the example, and tried it with my python 3.4, and the example provided here is pasted from my terminal. At least it worked for me. Also the documentation, at least the url I provided, clearly states that it preserves order, not the lexical order, but the order asked in this stackoverfow question.Deflower
No offence, but the question is about preserving the order of the attributes of an element. The documentation of lxml (on your link) says: "Attributes are just unordered name-value pairs...". I did not find anything about preserving the order of element attributes from the XML source. The tricky part of the question is that the author has more strict needs than those that are guaranteed by XML format -- which is understandable, but probably not implemented by lxml.Armored
My understanding of "Attributes are just unordered name-value pairs..." is, contrary to xml.etree.ElementTree who is ordering by lexical order, lxml is able to keep the non-lexical order, that is something like the FIFO order here. When you say "I did not find anything about preserving the order of element attributes from the XML source.", I would read the xml file with lxml (note the 'l'), and when writing, I would explicitely chose the order I want, using above example.Deflower
Is the preserving the order of element attributes documented for lxml? I did not find it, and I cannot rely on any guess based on any observation.Armored
This seems to work in my experience. I've just been writing a script to alter the AndroidManifest.xml file in .apk files and lxml.etree preserves the attribute order while xml.etree.ElementTree doesn't. As an added bonus it also preserves namespace alias names (which xml.etree.ElementTree fails to do)! Gets top marks from me.....Mcginnis
E
6

This has been "fixed" in python 3.8. I can't find any notes about it anywhere, but it works now.

D:\tmp\etree_order>type etree_order.py
import xml.etree.ElementTree as ET

a = ET.Element('a', {"aaa": "1", "ccc": "3", "bbb": "2"})

print(ET.tostring(a))
D:\tmp\etree_order>C:\Python37-64\python.exe etree_order.py
b'<a aaa="1" bbb="2" ccc="3" />'

D:\tmp\etree_order>c:\Python38-64\python.exe etree_order.py
b'<a aaa="1" ccc="3" bbb="2" />'
Extant answered 11/2, 2020 at 20:25 Comment(2)
This is not mentioned in What’s New In Python 3.8, but it is mentioned in the documentation for the tostring(), tostringlist() and dump() functions and the write() method.Hamamelidaceous
The documentation for ElementTree.write method states: "Changed in version 3.8: The write() method now preserves the attribute order specified by the user."Mho
G
5

Wrong question. Should be: "Where do I find a diff gadget that works sensibly with XML files?

Answer: Google is your friend. First result for search on "xml diff" => this. There are a few more possibles.

Glaucoma answered 30/4, 2010 at 2:1 Comment(4)
Always happy to see an alternate solution. Thanks.Undistinguished
In a perfect world, yes. However, sometimes we don't get to choose all the components of our toolset--for example, if your version control system can't be taught to diff XML files semantically, and you can't change to a different one.Teak
How do I integrate the tool with Github, Stash or any other web interface to a version control system?Caracaraballo
In many cases xml files are just obscure artifacts in a Git repository. It then is more sensible imo to minimize the default diff than requre the entire work group to install a tool to handle a dying file format. My responsibility in a team is not to mess up all the other members diffs. That is not done by requiring them to install a special tool. So I disagree regarding the original questions usefullness.Meza
V
3

From section 3.1 of the XML recommendation:

Note that the order of attribute specifications in a start-tag or empty-element tag is not significant.

Any system that relies on the order of attributes in an XML element is going to break.

Village answered 1/5, 2010 at 8:9 Comment(1)
This is not necessarily about correctness, but about maintaining minimal diff.Caracaraballo
S
3

This is a partial solution, for the case where xml is being emitted and a predictable order is desired. It does not solve round trip parsing and writing. Both 2.7 and 3.x use sorted() to force an attribute ordering. So, this code, in conjunction with use of an OrderedDictionary to hold the attributes will preserve the order for xml output to match the order used to create the Elements.

from collections import OrderedDict
from xml.etree import ElementTree as ET

# Make sorted() a no-op for the ElementTree module
ET.sorted = lambda x: x

try:
    # python3 use a cPython implementation by default, prevent that
    ET.Element = ET._Element_Py
    # similarly, override SubElement method if desired
    def SubElement(parent, tag, attrib=OrderedDict(), **extra):
        attrib = attrib.copy()
        attrib.update(extra)
        element = parent.makeelement(tag, attrib)
        parent.append(element)
        return element
    ET.SubElement = SubElement
except AttributeError:
    pass  # nothing else for python2, ElementTree is pure python

# Make an element with a particular "meaningful" ordering
t = ET.ElementTree(ET.Element('component',
                       OrderedDict([('grp','foo'),('name','bar'),
                                    ('class','exec'),('arch','x86')])))
# Add a child element
ET.SubElement(t.getroot(),'depend',
              OrderedDict([('grp','foo'),('name','util1'),('class','lib')]))  
x = ET.tostring(n)
print (x)
# Order maintained...
# <component grp="foo" name="bar" class="exec" arch="x86"><depend grp="foo" name="util1" class="lib" /></component>

# Parse again, won't be ordered because Elements are created
#   without ordered dict
print ET.tostring(ET.fromstring(x))
# <component arch="x86" name="bar" grp="foo" class="exec"><depend name="util1" grp="foo" class="lib" /></component>

The problem with parsing XML into an element tree is that the code internally creates plain dicts which are passed in to Element(), at which point the order is lost. No equivalent simple patch is possible.

Siccative answered 21/11, 2017 at 21:21 Comment(1)
it woks for me. and simple enough!Madelene
C
2

Have had your problem. Firstly looked for some Python script to canonize, didnt found anyone. Then started thinking about making one. Finally xmllintsolved.

Castellanos answered 18/6, 2013 at 9:3 Comment(1)
In those days since then have had kinda similar problem with rdf (an xml subset) which i solve with inner views and sorting alphabetically that views.Castellanos
G
0

I used the accepted answer above, with both statements:

ET._serialize_xml = _serialize_xml
ET._serialize['xml'] = _serialize_xml

While this fixed the ordering in every node, attribute ordering on new nodes inserted from copies of existing nodes failed to preserve without a deepcopy. Watch out for reusing nodes to create others... In my case I had an element with several attributes, so I wanted to reuse them:

to_add = ET.fromstring(ET.tostring(contract))
to_add.attrib['symbol'] = add
to_add.attrib['uniqueId'] = add
contracts.insert(j + 1, to_add)

The fromstring(tostring) will reorder the attributes in memory. It may not result in the alpha sorted dict of attributes, but it also may not have the expected ordering.

to_add = copy.deepcopy(contract)
to_add.attrib['symbol'] = add
to_add.attrib['uniqueId'] = add
contracts.insert(j + 1, to_add)

Now the ordering persists.

Genesis answered 30/7, 2018 at 1:23 Comment(1)
reusing a node? I wasn't able to comment so I added it as a complement to the accepted answer. It is to caution anyone that also wants to copy an existing and insert it with some values changed back into the tree. If one wants to do this, the accepted answer fails without a deepcopy.Genesis
H
0

I would recommend using LXML (as others have as well). If you need to preserve the order of attributes to adhere to the c14n v1 or v2 standards (https://www.w3.org/TR/xml-c14n2/) (i.e. increasing lexicographic order), lxml supports this very nicely by passing an output method (see heading C14N of https://lxml.de/api.html)

For example:

from lxml import etree as ET 
element = ET.Element('Test', B='beta', Z='omega', A='alpha') 
val = ET.tostring(element, method="c14n") 
print(val)
Henchman answered 4/5, 2021 at 16:24 Comment(0)
T
-2

By running the python script in python 3.8 version we can preserve the order of the attributes in xml files.

Tintinnabulation answered 18/6, 2020 at 11:25 Comment(1)
There is no new information here. See https://mcmap.net/q/426148/-can-elementtree-be-told-to-preserve-the-order-of-attributes.Hamamelidaceous

© 2022 - 2024 — McMap. All rights reserved.