Python pretty XML printer with lxml
Asked Answered
R

6

37

After reading from an existing file with 'ugly' XML and doing some modifications, pretty printing doesn't work. I've tried etree.write(FILE_NAME, pretty_print=True).

I have the following XML:

<testsuites tests="14" failures="0" disabled="0" errors="0" time="0.306" name="AllTests">
    <testsuite name="AIR" tests="14" failures="0" disabled="0" errors="0" time="0.306">
....

And I use it like this:

tree = etree.parse('original.xml')
root = tree.getroot()

...    
# modifications
...

with open(FILE_NAME, "w") as f:
    tree.write(f, pretty_print=True)
Rough answered 23/2, 2011 at 4:14 Comment(1)
There is a built-in indent() function since lxml 4.5.0. https://mcmap.net/q/425812/-changing-the-default-indentation-of-etree-tostring-in-lxmlRodomontade
P
80

For me, this issue was not solved until I noticed this little tidbit here:

http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output

Short version:

Read in the file with this command:

>>> parser = etree.XMLParser(remove_blank_text=True)
>>> tree = etree.parse(filename, parser)

That will "reset" the already existing indentation, allowing the output to generate it's own indentation correctly. Then pretty_print as normal:

>>> tree.write(<output_file_name>, pretty_print=True)
Plasticity answered 8/3, 2012 at 3:24 Comment(0)
O
18

Well, according to the API docs, there is no method "write" in the lxml etree module. You've got a couple of options in regards to getting a pretty printed xml string into a file. You can use the tostring method like so:

f = open('doc.xml', 'w')
f.write(etree.tostring(root, pretty_print=True))
f.close()

Or, if your input source is less than perfect and/or you want more knobs and buttons to configure your out put you could use one of the python wrappers for the tidy lib.

http://utidylib.berlios.de/

import tidy
f.write(tidy.parseString(your_xml_str, **{'output_xml':1, 'indent':1, 'input_xml':1}))

http://countergram.com/open-source/pytidylib

from tidylib import tidy_document
document, errors = tidy_document(your_xml_str, options={'output_xml':1, 'indent':1, 'input_xml':1})
f.write(document)
Oller answered 23/2, 2011 at 5:54 Comment(1)
That's because the write method is on the _ElementTree class, here: lxml.de/api/lxml.etree._ElementTree-class.html#writeMoncear
B
7

Here is an answer that is fixed to work with Python 3:

from lxml import etree
from sys import stdout
from io import BytesIO

parser = etree.XMLParser(remove_blank_text = True)
file_obj = BytesIO(text)
tree = etree.parse(file_obj, parser)
tree.write(stdout.buffer, pretty_print = True)

where text is the xml code as a sequence of bytes.

Batavia answered 16/7, 2018 at 22:46 Comment(0)
A
6
fp = file('out.txt', 'w')
print(e.tree.tostring(...), file=fp)
fp.close()
Amoebocyte answered 23/2, 2011 at 4:19 Comment(1)
What is e.tree?Hanhhank
V
0

I am not sure why other answers did not mention this. If you want to obtain the root of the xml there is a method called getroot(). I hope I answered your question (though a little late).

tree = et.parse(xmlFile)
root = tree.getroot()
Video answered 3/1, 2013 at 3:48 Comment(0)
D
0

Of course - pretty print of lxml.etree is possible.

In my case, the old trick with remove_blank_text=True and pretty_print=True was not working as I expected (was too delicate), so I decided to write it by myself.

Here is it - a modern, forcible, native pythonic way to correct lxml.etee.Element tree indentation. This gives a nicely prettified XML string:

from typing import Optional

import lxml.etree


def indent_lxml(element: lxml.etree.Element, level: int = 0, is_last_child: bool = True) -> None:
    space = "    "
    indent_str = "\n" + level * space

    element.text = strip_or_null(element.text)
    if element.text:
        element.text = f"{indent_str}{space}{element.text}"

    num_children = len(element)
    if num_children:
        element.text = f"{element.text or ''}{indent_str}{space}"

        for index, child in enumerate(element.iterchildren()):
            is_last = index == num_children - 1
            indent_lxml(child, level + 1, is_last)

    elif element.text:
        element.text += indent_str

    tail_level = max(0, level - 1) if is_last_child else level
    tail_indent = "\n" + tail_level * space
    tail = strip_or_null(element.tail)
    element.tail = f"{indent_str}{tail}{tail_indent}" if tail else tail_indent


def strip_or_null(text: Optional[str]) -> Optional[str]:
    if text is not None:
        return text.strip() or None

It's decent fast, because it doesn't allocate any additional structures in memory and also traversing the tree - it visits each node only once, giving the best possible - O x N computational complexity.

It rearranges all the existing indentation "in place" in the tree (the DOM) by correcting contents of Element.text and Element.tail attributes (affects white-spaces only).

Naturally, it also can be used with HTML parsed by lxml.

In order to use it, do something like that:

root = lxml.etree.parse("path/to/the_file.xml").getroot()
# or
root = lxml.etree.fromstring("<xml><body><leaf1/><leaf2/></body></xml>")

indent_lxml(root)  # corrects indentation "in place"

result = lxml.etree.tostring(root, encoding="unicode")
print(result)

Which prints:

<xml>
    <body>
        <leaf1/>
        <leaf2/>
    </body>
</xml>
Doornail answered 16/7, 2022 at 20:20 Comment(2)
Does this solution provide anything more than the built-in indent() function? lxml.de/apidoc/lxml.etree.html#lxml.etree.indentRodomontade
Yes, the output of this one is prettier. IMO the result of indent() is disappointing.Doornail

© 2022 - 2024 — McMap. All rights reserved.