How do I get Python's ElementTree to pretty print to an XML file?
Asked Answered
E

8

86

Background

I am using SQLite to access a database and retrieve the desired information. I'm using ElementTree in Python version 2.6 to create an XML file with that information.

Code

import sqlite3
import xml.etree.ElementTree as ET

# NOTE: Omitted code where I acccess the database,
# pull data, and add elements to the tree

tree = ET.ElementTree(root)

# Pretty printing to Python shell for testing purposes
from xml.dom import minidom
print minidom.parseString(ET.tostring(root)).toprettyxml(indent = "   ")

#######  Here lies my problem  #######
tree.write("New_Database.xml")

Attempts

I've tried using tree.write("New_Database.xml", "utf-8") in place of the last line of code above, but it did not edit the XML's layout at all - it's still a jumbled mess.

I also decided to fiddle around and tried doing:
tree = minidom.parseString(ET.tostring(root)).toprettyxml(indent = " ")
instead of printing this to the Python shell, which gives the error AttributeError: 'unicode' object has no attribute 'write'.

Questions

When I write my tree to an XML file on the last line, is there a way to pretty print to the XML file as it does to the Python shell?

Can I use toprettyxml() here or is there a different way to do this?

Ebberta answered 2/3, 2015 at 15:48 Comment(1)
Related: Use xml.etree.elementtree to print nicely formatted xml filesCastiron
D
88

Whatever your XML string is, you can write it to the file of your choice by opening a file for writing and writing the string to the file.

from xml.dom import minidom

xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent="   ")
with open("New_Database.xml", "w") as f:
    f.write(xmlstr)

There is one possible complication, especially in Python 2, which is both less strict and less sophisticated about Unicode characters in strings. If your toprettyxml method hands back a Unicode string (u"something"), then you may want to cast it to a suitable file encoding, such as UTF-8. E.g. replace the one write line with:

f.write(xmlstr.encode('utf-8'))
Dissolve answered 2/3, 2015 at 15:57 Comment(8)
This answer would be clearer if you included the import xml.dom.minidom as minidom statement that appears to be required.Clotilde
@KenPronovici Possibly. That import appears in the original question, but I've added it here so there's no confusion.Dissolve
This answer is repeated so often on any kind of questions, but it is anything but a good answer: You fully need to convert the whole XML tree to a string, reparse it, to again get it printed, this time just differently. This is not a good approach. Use lxml instead and serialize directly using the builtin method provided by lxml, this way eliminating any interemediate printing followed by reparsing.Lucindalucine
This is an answer about how serialized XML gets written to file, not an endorsement of the OP's serialization strategy, which is undoubtedly Byzantine. I love lxml, but being based on C, it's not always available.Dissolve
In case one wants to use lxml might look at my answer below.Smaragd
This also uses both minidom and elementtree. It would better if it were just one or the other.Anorexia
Change .toprettyxml(indent=" ") to .toprettyxml(indent=" ", newl="\r") to remove all of the blank lines.Polychaete
If you don't want the XML version tag that minidom adds, you can change it to f.write(xmlstr.split('\n', 1)[1])Butterfish
E
112

I simply solved it with the indent() function:

xml.etree.ElementTree.indent(tree, space=" ", level=0) Appends whitespace to the subtree to indent the tree visually. This can be used to generate pretty-printed XML output. tree can be an Element or ElementTree. space is the whitespace string that will be inserted for each indentation level, two space characters by default. For indenting partial subtrees inside of an already indented tree, pass the initial indentation level as level.

tree = ET.ElementTree(root)
ET.indent(tree, space="\t", level=0)
tree.write(file_name, encoding="utf-8")

Note, the indent() function was added in Python 3.9.

Exophthalmos answered 2/8, 2021 at 7:43 Comment(5)
It should be mentioned that the indent() function was added in Python 3.9.Gray
You are the person. The very person. This is overwhelmingly the best answer.Solutrean
Note that @Tatarize’s answer has effectively a polyfill for this that works on older Python versions.Fixing
I am also overwhelmed with appreciation for this answer.Irish
use space=" " if you want 4 spaces (there are 4 spaces between the quotes)Unsuspecting
D
88

Whatever your XML string is, you can write it to the file of your choice by opening a file for writing and writing the string to the file.

from xml.dom import minidom

xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent="   ")
with open("New_Database.xml", "w") as f:
    f.write(xmlstr)

There is one possible complication, especially in Python 2, which is both less strict and less sophisticated about Unicode characters in strings. If your toprettyxml method hands back a Unicode string (u"something"), then you may want to cast it to a suitable file encoding, such as UTF-8. E.g. replace the one write line with:

f.write(xmlstr.encode('utf-8'))
Dissolve answered 2/3, 2015 at 15:57 Comment(8)
This answer would be clearer if you included the import xml.dom.minidom as minidom statement that appears to be required.Clotilde
@KenPronovici Possibly. That import appears in the original question, but I've added it here so there's no confusion.Dissolve
This answer is repeated so often on any kind of questions, but it is anything but a good answer: You fully need to convert the whole XML tree to a string, reparse it, to again get it printed, this time just differently. This is not a good approach. Use lxml instead and serialize directly using the builtin method provided by lxml, this way eliminating any interemediate printing followed by reparsing.Lucindalucine
This is an answer about how serialized XML gets written to file, not an endorsement of the OP's serialization strategy, which is undoubtedly Byzantine. I love lxml, but being based on C, it's not always available.Dissolve
In case one wants to use lxml might look at my answer below.Smaragd
This also uses both minidom and elementtree. It would better if it were just one or the other.Anorexia
Change .toprettyxml(indent=" ") to .toprettyxml(indent=" ", newl="\r") to remove all of the blank lines.Polychaete
If you don't want the XML version tag that minidom adds, you can change it to f.write(xmlstr.split('\n', 1)[1])Butterfish
B
17

Riffing on Ben Anderson answer as a function.

def _pretty_print(current, parent=None, index=-1, depth=0):
    for i, node in enumerate(current):
        _pretty_print(node, current, i, depth + 1)
    if parent is not None:
        if index == 0:
            parent.text = '\n' + ('\t' * depth)
        else:
            parent[index - 1].tail = '\n' + ('\t' * depth)
        if index == len(parent) - 1:
            current.tail = '\n' + ('\t' * (depth - 1))

So running the test on unpretty data:

import xml.etree.ElementTree as ET
root = ET.fromstring('''<?xml version='1.0' encoding='utf-8'?>
<root>
    <data version="1"><data>76939</data>
</data><data version="2">
        <data>266720</data><newdata>3569</newdata>
    </data> <!--root[-1].tail-->
    <data version="3"> <!--addElement's text-->
<data>5431</data> <!--newData's tail-->
    </data> <!--addElement's tail-->
</root>
''')
_pretty_print(root)

tree = ET.ElementTree(root)
tree.write("pretty.xml")
with open("pretty.xml", 'r') as f:
    print(f.read())

We get:

<root>
    <data version="1">
        <data>76939</data>
    </data>
    <data version="2">
        <data>266720</data>
        <newdata>3569</newdata>
    </data>
    <data version="3">
        <data>5431</data>
    </data>
</root>
Bothersome answered 20/1, 2021 at 11:7 Comment(3)
This solution has a couple of nice traits compared to other answers. It does not require additional libraries; it works on pre-3.9 Python, and it is very explicit about what whitespace it adds where to the tree (which really helps in understanding the issue at hand). Oh, and it generated byte-for-byte identical XML to my hand crafted reference file :-)Staford
Doesn't work correctly if there is an empty tagDextro
I assume if current is None would cover that possibility.Bothersome
I
15

I found a way using straight ElementTree, but it is rather complex.

ElementTree has functions that edit the text and tail of elements, for example, element.text="text" and element.tail="tail". You have to use these in a specific way to get things to line up, so make sure you know your escape characters.

As a basic example:

I have the following file:

<?xml version='1.0' encoding='utf-8'?>
<root>
    <data version="1">
        <data>76939</data>
    </data>
    <data version="2">
        <data>266720</data>
        <newdata>3569</newdata>
    </data>
</root>

To place a third element in and keep it pretty, you need the following code:

addElement = ET.Element("data")             # Make a new element
addElement.set("version", "3")              # Set the element's attribute
addElement.tail = "\n"                      # Edit the element's tail
addElement.text = "\n\t\t"                  # Edit the element's text
newData = ET.SubElement(addElement, "data") # Make a subelement and attach it to our element
newData.tail = "\n\t"                       # Edit the subelement's tail
newData.text = "5431"                       # Edit the subelement's text
root[-1].tail = "\n\t"                      # Edit the previous element's tail, so that our new element is properly placed
root.append(addElement)                     # Add the element to the tree.

To indent the internal tags (like the internal data tag), you have to add it to the text of the parent element. If you want to indent anything after an element (usually after subelements), you put it in the tail.

This code give the following result when you write it to a file:

<?xml version='1.0' encoding='utf-8'?>
<root>
    <data version="1">
        <data>76939</data>
    </data>
    <data version="2">
        <data>266720</data>
        <newdata>3569</newdata>
    </data> <!--root[-1].tail-->
    <data version="3"> <!--addElement's text-->
        <data>5431</data> <!--newData's tail-->
    </data> <!--addElement's tail-->
</root>

As another note, if you wish to make the program uniformally use \t, you may want to parse the file as a string first, and replace all of the spaces for indentations with \t.

This code was made in Python3.7, but still works in Python2.7.

Imelda answered 13/6, 2019 at 21:28 Comment(3)
It would be nice if you would not have to indent it manually.Isogonic
Bravo! This is dedication!Geyserite
@Isogonic I posted an answer using the same method as a function call for the tree.Bothersome
R
10

Install bs4

pip install bs4

Use this code to pretty print:

from bs4 import BeautifulSoup

x = your xml

print(BeautifulSoup(x, "xml").prettify())
Ruisdael answered 18/6, 2016 at 3:33 Comment(3)
This is a good solution for when we don't want to write the XML to a file.Hackney
I get an error when I try this "Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?" I have valid XML in string format. To I need something more?Kayleigh
@Tim, you need to install a parser library, e.g. lxml, html5lib, with the usual pip, brew, conda approach you use.Rapacious
S
8

If one wants to use lxml, it could be done in the following way:

from lxml import etree

xml_object = etree.tostring(root,
                            pretty_print=True,
                            xml_declaration=True,
                            encoding='UTF-8')

with open("xmlfile.xml", "wb") as writter:
    writter.write(xml_object)`

If you see xml namespaces e.g. py:pytype="TREE", one might want to add before the creation of xml_object

etree.cleanup_namespaces(root) 

This should be sufficient for any adaptation in your code.

Smaragd answered 23/4, 2018 at 10:22 Comment(4)
Tried this, but the root has to be a part of lxml and not ETtreeSerg
@ManabuTokunaga, I am not entirely sure what do you mean. I believe I tested it with both objectify and etree. I will double check when I have a chance but, it will be good to clarify how you create a root object straight from lxml.Smaragd
Let me see if I can generate an isolated case. But point was that I had a root based on import xml.etree.ElementTree as ETree and I had some error message when I tried your suggestion.Serg
@ManabuTokunaga is correct - the ETree root is of type xml.etree.ElementTree.Element but the lxml root is of type lxml.etree._Element - totally different types. Also with Python 3.8 and using lxml I had to add: xmlstr = xmlstr.decode("utf-8") after tostringConsult
D
2

One liner(*) to read, parse (once) and pretty print XML from file named fname:

from xml.dom import minidom
print(minidom.parseString(open(fname).read()).toprettyxml(indent="  "))

(* not counting import)

Demetriusdemeyer answered 11/5, 2022 at 15:22 Comment(4)
why is this not right answer?Phenolphthalein
Writing another file may be unnecessary and comment does not show how to get the string from the ElementTree. Also its superfluous to use the (broken) minidom parser. See for example this bug bugs.python.org/issue23847Merimerida
Yeah, aware of the minidom issues, but not many options when running on systems without modern pythons or ability to install and use libraries. @Merimerida not sure what "Writing another file may be unnecessary" means, this code only prints to stdoutDemetriusdemeyer
"without modern pythons" Would be nice to attach the version compatibility then, as ETree does not have broken pretty print. "not sure what "Writing another file may be unnecessary" means, this code only prints to stdout" My point is that writing to a variable is the same as writing to a file unless utf8/utf16 encoding is broken.Merimerida
G
2

Similar to Rafal.Py's solution but doesn't modify the input and returns the XML as formatted string:

def prettyPrint(element):
    encoding = 'UTF-8'
    # Create a copy of the input element: Convert to string, then parse again
    copy = ET.fromstring(ET.tostring(element))
    # Format copy. This needs Python 3.9+
    ET.indent(copy, space="    ", level=0)
    # tostring() returns a binary, so we need to decode it to get a string
    return ET.tostring(copy, encoding=encoding).decode(encoding)

If you need a file, replace the last line with with copy.write(...) to avoid the extra overhead.

Galton answered 24/7, 2022 at 19:24 Comment(5)
In what way does your answer improve upon Rafal's ET.indent() one, which was posted one year earlier?Sortie
@Sortie His answer didn't work with Python 3.9 anymore.Galton
Really? How is this possible when he posted an answer that uses ET.indent() which was introduced in Python 3.9 ... I just tested his original edit under Python 3.11 and it works fine.Sortie
@Sortie Okay, I don't remember exactly but it took my quite some time to get this code working. My version doesn't modify the input. This becomes important when you try to log subtrees. Also, he just writes to a file. I struggled a lot to get the new XML as string.Galton
@AaronDigulla you should add your comment "doesn't modify the input" and "get XML as a string" to your answerDemetriusdemeyer

© 2022 - 2024 — McMap. All rights reserved.