Write .xml in Python with pretty print and encoding declaration
Asked Answered
R

3

6

I have to create an .xml file that has pretty print and also the encoding declaration. It should look like this:

<?xml version='1.0' encoding='utf-8'?>
<main>
    <sub>
        <name>Ana</name>
        <detail />
        <type>smart</type>
    </sub>
</main>

I know how to get the pretty print and the declaration, but not at the same time. To obtain the UTF-8 declaration, but no pretty print, I use the code below:

f = open(xmlPath, "w")
et.write(f, encoding='utf-8', xml_declaration=True) 
f.close()

But if I want to get the pretty print, I have to convert the xml tree into string, and I will lose the declaration. I use this code:

from xml.dom import minidom
xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent="   ")
with open(xmlPath, "w") as f:
    f.write(xmlstr.encode('utf-8'))
    f.close()

With this last code, I get the pretty print, only that the first row is:

<?xml version="1.0" ?>

I might just as well replace this with

<?xml version='1.0' encoding='utf-8'?>

but I don't find this to be the most pythonesque method.

I use the xml module and I prefer not to install extra modules because the script has to be run from various computers with standard Python. But if it's not possible, I will install other modules.

Later Edit:

In the end, with Lenz's help, I use this:

#ET=lxml.etree
xmlPath=os.path.join(output_folderXML ,"test.xml")
xmlstr= ET.tostring(root, encoding='UTF-8', xml_declaration=True, pretty_print=True)
with open(xmlPath, "w") as f:
    f.write(xmlstr)
    f.close()

I need to know if it is safe to write the result of the "tostring" method to the .xml file in the "w" mode, not "wb". As I said in one of the comments below, with "wb" I don't get the pretty print when I open the xml file in Notepad, but with "w", I do. Also, I have checked the xml file written in "w" mode and the special characters like "ü" are there. I only need an competent opinion that what I do is technically OK.

Revisionist answered 27/11, 2017 at 20:13 Comment(0)
T
5

The most elegant solution is certainly using the third-party library lxml, which is being used a lot – for good reasons. It offers both a pretty_print and an xml_declaration parameter in the tostring() method, so you get both. And the API is quite close to that of the std-lib ElementTree, which you seem to be using now. Here's an example:

>>> from lxml import etree
>>> doc = etree.parse(xmlPath)
>>> print etree.tostring(doc, encoding='UTF-8', xml_declaration=True,
                         pretty_print=True)
<?xml version='1.0' encoding='UTF-8'?>
<main>
  <sub>
    <name>Ana</name>
    <detail/>
    <type>smart</type>
  </sub>
</main>

However, I understand your desire to use the "included batteries" only. As far as I can see, xml.etree.ElementTree has no means of changing the indentation automatically. But the minidom work-around has a solution to getting both pretty-printing and a full declaration: use the encoding parameter of the toprettyxml() method!

>>> doc = minidom.parseString(ET.tostring(root))
>>> print doc.toprettyxml(encoding='utf8')
<?xml version="1.0" encoding="utf8"?>
<main>
    <sub>
        <name>Ana</name>
        <detail/>
        <type>smart</type>
    </sub>
</main>

(Be aware that the returned string is already encoded and that you should write it to a file opened in binary mode ("wb") and without further encoding.)

Thracophrygian answered 27/11, 2017 at 22:48 Comment(5)
With your solution I got the desired result. Although for me it works only if I use 'utf8' instead of 'utf-8'. I would also appreciate if you would write as well the lxml method or give me a link to it.Revisionist
@Revisionist I added an example using lxml. Concerning the spelling of the encoding, I can't see a difference for using "utf-8" or "utf8". However, toprettyxml() is confused when the input string contains newlines and inserts blank lines.Thracophrygian
I have another problem now. I decided to use the lxml module eventually. I have two variants: toXml=ET.ElementTree(root) toXml.write(xmlPath,xml_declaration=True, encoding='utf-8',pretty_print=True) or your method, with opening a file in 'wb' and write the string there. The problem is when I open the xml in Notepad (basic), the text is not prettified. In a browser or in Notepad++ it looks OK, but I need the pretty print in Notepad. I noticed that if I write the string into a file in the "w" mode (not "wb"), the text looks OK in Notepad, as well. Is it so important to use the "wb" mode?Revisionist
Well, I guess the problem is with the newlines then: some Windows programs refuse to recognise Unix newlines (LF "line feed", Python's "\n"), and require two characters per line break (CR+LF, "\r\n"). In text mode on Windows, Python substitutes "\n" with "\r\n" when writing; that's why it happens with "w" mode. However, you might run into encoding problems eventually if you write bytestrings in text mode, and it won't work in Python 3 (you'll have to switch too, at some point). I'd suggest you do xmlstr.replace(b'\n', b'\r\n') on the serialised document before writing.Thracophrygian
Thank you so much! Writing in "wb" mode and doing xmlstr.replace(b'\n', b'\r\n') makes, indeed, the xml pretty when opened in Notepad!Revisionist
S
5
from xml.dom import minidom
xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent="   ", encoding='UTF-8')
with open(xmlPath, "w") as f:
    f.write(str(xmlstr.decode('UTF-8')))
    f.close()

Probably This will resolve your issue without using external libraries like lxml

Sport answered 6/6, 2018 at 5:18 Comment(1)
isn't the point of the with statement that you don't need the f.close() ?Exuviate
P
1

After some struggle and read tons of ugly code I come up with this simple yet effective solution for writing indented XML files using E-Factory of lxmllybrary.

This solution is a collections of the other solution provided but implemented using E-Factory for whom finds it more readable and Pythonic

from lxml import etree, builder
E = builder.ElementMaker()

the_doc = E.root(
        E.data(
            E.field1('Text...', name='field.one.name', id="field-id"),
            E.field2('Text...', name='field.two.name', id="field-id"),
            E.field3(
                E.subfield1('Text...', name='subfield.one.name', id="field-id"),
                E.subfield2('Text...', name='subfield.two.name', id="field-id"),
            )
            )
        )

# Handling the Pretty print 
pprinted_xml = etree.tostring(the_doc, encoding='UTF-8', xml_declaration=True,
                         pretty_print=True)
# Creating the XML file
with open('test.xml', 'wb') as f:
    f.write(pprinted_xml)

Result

<?xml version='1.0' encoding='UTF-8'?>
<root>
  <data>
    <field1 name="field.one.name" id="field-id">Text...</field1>
    <field2 name="field.two.name" id="field-id">Text...</field2>
    <field3>
      <subfield1 name="subfield.one.name" id="field-id">Text...</subfield1>
      <subfield2 name="subfield.two.name" id="field-id">Text...</subfield2>
    </field3>
  </data>
</root>
Polyamide answered 9/9, 2021 at 11:28 Comment(2)
What does the E-Factory have to do with pretty-printing?Thracophrygian
@Thracophrygian no one said that you should use E-Factory with pretty print. This snippet of code is just a recipe for writing a pretty printed xml file while keeping the version + encoding however I found using E-Factory better and more readable (because of the nesting nature), Especially because in the Question the user says: but I don't find this to be the most pythonesque method.Catamite

© 2022 - 2024 — McMap. All rights reserved.