Write .xml in Python with pretty print and encoding declaration

Asked 27/11, 2017 at 20:13 Answered 9/9, 2021 at 11:28

Solved xml python-2.7 utf-8 pretty-print

I have to create an .xml file that has pretty print and also the encoding declaration. It should look like this:

<?xml version='1.0' encoding='utf-8'?>
<main>
    <sub>
        <name>Ana</name>
        <detail />
        <type>smart</type>
    </sub>
</main>

I know how to get the pretty print and the declaration, but not at the same time. To obtain the UTF-8 declaration, but no pretty print, I use the code below:

f = open(xmlPath, "w")
et.write(f, encoding='utf-8', xml_declaration=True) 
f.close()

But if I want to get the pretty print, I have to convert the xml tree into string, and I will lose the declaration. I use this code:

from xml.dom import minidom
xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent="   ")
with open(xmlPath, "w") as f:
    f.write(xmlstr.encode('utf-8'))
    f.close()

With this last code, I get the pretty print, only that the first row is:

<?xml version="1.0" ?>

I might just as well replace this with

<?xml version='1.0' encoding='utf-8'?>

but I don't find this to be the most pythonesque method.

I use the xml module and I prefer not to install extra modules because the script has to be run from various computers with standard Python. But if it's not possible, I will install other modules.

Later Edit:

In the end, with Lenz's help, I use this:

#ET=lxml.etree
xmlPath=os.path.join(output_folderXML ,"test.xml")
xmlstr= ET.tostring(root, encoding='UTF-8', xml_declaration=True, pretty_print=True)
with open(xmlPath, "w") as f:
    f.write(xmlstr)
    f.close()

I need to know if it is safe to write the result of the "tostring" method to the .xml file in the "w" mode, not "wb". As I said in one of the comments below, with "wb" I don't get the pretty print when I open the xml file in Notepad, but with "w", I do. Also, I have checked the xml file written in "w" mode and the special characters like "ü" are there. I only need an competent opinion that what I do is technically OK.

Revisionist answered 27/11, 2017 at 20:13 Comment(0)

The most elegant solution is certainly using the third-party library lxml, which is being used a lot – for good reasons. It offers both a pretty_print and an xml_declaration parameter in the tostring() method, so you get both. And the API is quite close to that of the std-lib ElementTree, which you seem to be using now. Here's an example:

>>> from lxml import etree
>>> doc = etree.parse(xmlPath)
>>> print etree.tostring(doc, encoding='UTF-8', xml_declaration=True,
                         pretty_print=True)
<?xml version='1.0' encoding='UTF-8'?>
<main>
  <sub>
    <name>Ana</name>
    <detail/>
    <type>smart</type>
  </sub>
</main>

However, I understand your desire to use the "included batteries" only. As far as I can see, xml.etree.ElementTree has no means of changing the indentation automatically. But the minidom work-around has a solution to getting both pretty-printing and a full declaration: use the encoding parameter of the toprettyxml() method!

>>> doc = minidom.parseString(ET.tostring(root))
>>> print doc.toprettyxml(encoding='utf8')
<?xml version="1.0" encoding="utf8"?>
<main>
    <sub>
        <name>Ana</name>
        <detail/>
        <type>smart</type>
    </sub>
</main>

(Be aware that the returned string is already encoded and that you should write it to a file opened in binary mode ("wb") and without further encoding.)

Thracophrygian answered 27/11, 2017 at 22:48 Comment(5)

With your solution I got the desired result. Although for me it works only if I use 'utf8' instead of 'utf-8'. I would also appreciate if you would write as well the lxml method or give me a link to it. – Revisionist 28/11, 2017 at 7:21

@Revisionist I added an example using lxml. Concerning the spelling of the encoding, I can't see a difference for using "utf-8" or "utf8". However, toprettyxml() is confused when the input string contains newlines and inserts blank lines. – Thracophrygian 28/11, 2017 at 9:13

I have another problem now. I decided to use the lxml module eventually. I have two variants: toXml=ET.ElementTree(root) toXml.write(xmlPath,xml_declaration=True, encoding='utf-8',pretty_print=True) or your method, with opening a file in 'wb' and write the string there. The problem is when I open the xml in Notepad (basic), the text is not prettified. In a browser or in Notepad++ it looks OK, but I need the pretty print in Notepad. I noticed that if I write the string into a file in the "w" mode (not "wb"), the text looks OK in Notepad, as well. Is it so important to use the "wb" mode? – Revisionist 3/12, 2017 at 16:0

Well, I guess the problem is with the newlines then: some Windows programs refuse to recognise Unix newlines (LF "line feed", Python's "\n"), and require two characters per line break (CR+LF, "\r\n"). In text mode on Windows, Python substitutes "\n" with "\r\n" when writing; that's why it happens with "w" mode. However, you might run into encoding problems eventually if you write bytestrings in text mode, and it won't work in Python 3 (you'll have to switch too, at some point). I'd suggest you do xmlstr.replace(b'\n', b'\r\n') on the serialised document before writing. – Thracophrygian 3/12, 2017 at 21:9

Thank you so much! Writing in "wb" mode and doing xmlstr.replace(b'\n', b'\r\n') makes, indeed, the xml pretty when opened in Notepad! – Revisionist 4/12, 2017 at 7:58

from xml.dom import minidom
xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent="   ", encoding='UTF-8')
with open(xmlPath, "w") as f:
    f.write(str(xmlstr.decode('UTF-8')))
    f.close()

Probably This will resolve your issue without using external libraries like lxml

Sport answered 6/6, 2018 at 5:18 Comment(1)

isn't the point of the with statement that you don't need the f.close() ? – Exuviate 27/5, 2022 at 14:7

After some struggle and read tons of ugly code I come up with this simple yet effective solution for writing indented XML files using E-Factory of lxmllybrary.

This solution is a collections of the other solution provided but implemented using E-Factory for whom finds it more readable and Pythonic

from lxml import etree, builder
E = builder.ElementMaker()

the_doc = E.root(
        E.data(
            E.field1('Text...', name='field.one.name', id="field-id"),
            E.field2('Text...', name='field.two.name', id="field-id"),
            E.field3(
                E.subfield1('Text...', name='subfield.one.name', id="field-id"),
                E.subfield2('Text...', name='subfield.two.name', id="field-id"),
            )
            )
        )

# Handling the Pretty print 
pprinted_xml = etree.tostring(the_doc, encoding='UTF-8', xml_declaration=True,
                         pretty_print=True)
# Creating the XML file
with open('test.xml', 'wb') as f:
    f.write(pprinted_xml)

Result

<?xml version='1.0' encoding='UTF-8'?>
<root>
  <data>
    <field1 name="field.one.name" id="field-id">Text...</field1>
    <field2 name="field.two.name" id="field-id">Text...</field2>
    <field3>
      <subfield1 name="subfield.one.name" id="field-id">Text...</subfield1>
      <subfield2 name="subfield.two.name" id="field-id">Text...</subfield2>
    </field3>
  </data>
</root>

Polyamide answered 9/9, 2021 at 11:28 Comment(2)

What does the E-Factory have to do with pretty-printing? – Thracophrygian 9/9, 2021 at 11:49

@Thracophrygian no one said that you should use E-Factory with pretty print. This snippet of code is just a recipe for writing a pretty printed xml file while keeping the version + encoding however I found using E-Factory better and more readable (because of the nesting nature), Especially because in the Question the user says: but I don't find this to be the most pythonesque method. – Catamite 9/9, 2021 at 11:57

Result

Recommended topics

Hot tags