Problem with newlines when I use toprettyxml()
Asked Answered
P

8

18

I'm currently using the toprettyxml() function of the xml.dom module in a Python script and I'm having some trouble with the newlines. If don't use the newl parameter or if I use toprettyxml(newl='\n') it displays several newlines instead of only one.

For instance

f = open(filename, 'w')
f.write(dom1.toprettyxml(encoding='UTF-8'))
f.close()

displayed:

<params>


    <param name="Level" value="#LEVEL#"/>


    <param name="Code" value="281"/>


</params>

Does anyone know where the problem comes from and how I can use it? FYI I'm using Python 2.6.1

Prefabricate answered 2/11, 2009 at 16:40 Comment(0)
C
14

toprettyxml() is quite awful. It is not a matter of Windows and '\r\n'. Trying any string as the newlparameter shows that too many lines are being added. Not only that, but other blanks (that may cause you problems when a machine reads the xml) are also added.

Some workarounds available at
http://ronrothman.com/public/leftbraned/xml-dom-minidom-toprettyxml-and-silly-whitespace

Clot answered 16/3, 2010 at 14:40 Comment(4)
thanks a lot Xv ! Indeed, now, I'm trying to use toprettyxml() as few as possible but it's good to know there is a workaround for this annoying issue. And the post is very clearPrefabricate
from xml.dom.ext import PrettyPrint from StringIO import StringIO def toprettyxml_fixed (node, encoding='utf-8'): tmpStream = StringIO() PrettyPrint(node, stream=tmpStream, encoding=encoding) return tmpStream.getvalue()Seko
It would be helpful to link the code, in case your website goes down etc.Seko
xml.dom.ext was not added to Python libstd. It is part of custom distributions, probably to fix this awfulness.Seko
P
20

I found another great solution :

f = open(filename, 'w')
dom_string = dom1.toprettyxml(encoding='UTF-8')
dom_string = os.linesep.join([s for s in dom_string.splitlines() if s.strip()])
f.write(dom_string)
f.close()

Above solution basically removes the unwanted newlines from the dom_string which are generated by toprettyxml().

Inputs taken from -> What's a quick one-liner to remove empty lines from a python string?

Puerperal answered 11/10, 2016 at 18:40 Comment(4)
For python3, it needs to be dom_string = b'\n'.join([s for s in dom_string.splitlines() if s.strip()])Mcabee
This is the only solution I've found that doesn't require a third-party library, nice work.Waal
this particular code does not work, since excplicit encoding argument makes the function to return bytes, not stringDesalinate
Here it is an updated version that work with Python 3.10 '\n'.join([s for s in dom_string.decode().splitlines() if s.strip()]) The answer should be edited.Irrefutable
C
14

toprettyxml() is quite awful. It is not a matter of Windows and '\r\n'. Trying any string as the newlparameter shows that too many lines are being added. Not only that, but other blanks (that may cause you problems when a machine reads the xml) are also added.

Some workarounds available at
http://ronrothman.com/public/leftbraned/xml-dom-minidom-toprettyxml-and-silly-whitespace

Clot answered 16/3, 2010 at 14:40 Comment(4)
thanks a lot Xv ! Indeed, now, I'm trying to use toprettyxml() as few as possible but it's good to know there is a workaround for this annoying issue. And the post is very clearPrefabricate
from xml.dom.ext import PrettyPrint from StringIO import StringIO def toprettyxml_fixed (node, encoding='utf-8'): tmpStream = StringIO() PrettyPrint(node, stream=tmpStream, encoding=encoding) return tmpStream.getvalue()Seko
It would be helpful to link the code, in case your website goes down etc.Seko
xml.dom.ext was not added to Python libstd. It is part of custom distributions, probably to fix this awfulness.Seko
K
5

toprettyxml(newl='') works for me on Windows.

Karriekarry answered 22/4, 2014 at 15:42 Comment(2)
Work on Ubuntu 16.04 (bash) tooSweat
This works on every line except the first one, it seems to strip the newline and combine line 1 and 2...Waal
N
4

This is a pretty old question but I guess I know what the problem is:

Minidoms pretty print has a pretty straight forward method. It just adds the characters that you specified as arguments. That means, it will duplicate the characters if they already exist.

E.g. if you parse an XML file that looks like this:

<parent>
   <child>
      Some text
   </child>
</parent>

there are already newline characters and indentions within the dom. Those are taken as text nodes by minidom and are still there when you parse it it into a dom object.

If you now proceed to convert the dom object into an XML string, those text nodes will still be there. Meaning new line characters and indent tabs are still remaining. Using pretty print now, will just add more new lines and more tabs. That's why in this case not using pretty print at all or specifying newl='' will result in the wanted output.

However, you generate the dom in your script, the text nodes will not be there, therefore pretty printing with newl='\r\n' and/or addindent='\t' will turn out quite pretty.

TL;DR Indents and newlines remain from parsing and pretty print just adds more

Neutron answered 20/9, 2017 at 9:55 Comment(0)
C
2

If you don't mind installing new packages, try beautifulsoup. I had very good experiences with its xml prettyfier.

Corpsman answered 30/12, 2009 at 23:20 Comment(0)
B
0

Following function worked for my problem. I had to use python 2.7 and i was not allowed to install any 3rd party additional package.

The crux of implementation is as follows:

  1. Use dom.toprettyxml()
  2. Remove all white spaces
  3. Add new lines and tabs as per your requirement.

~

import os
import re
import xml.dom.minidom
import sys

class XmlTag:
    opening = 0
    closing = 1
    self_closing = 2
    closing_tag = "</"
    self_closing_tag = "/>"
    opening_tag = "<"

def to_pretty_xml(xml_file_path):
    pretty_xml = ""
    space_or_tab_count = "  " # Add spaces or use \t
    tab_count = 0
    last_tag = -1

    dom = xml.dom.minidom.parse(xml_file_path)

    # get pretty-printed version of input file
    string_xml = dom.toprettyxml(' ', os.linesep)

    # remove version tag
    string_xml = string_xml.replace("<?xml version=\"1.0\" ?>", '')

    # remove empty lines and spaces
    string_xml = "".join(string_xml.split())

    # move each tag to new line
    string_xml = string_xml.replace('>', '>\n')

    for line in string_xml.split('\n'):
        if line.__contains__(XmlTag.closing_tag):

            # For consecutive closing tags decrease the indentation
            if last_tag == XmlTag.closing:
                tab_count = tab_count - 1

            # Move closing element to next line
            if last_tag == XmlTag.closing or last_tag == XmlTag.self_closing:
                pretty_xml = pretty_xml + '\n' + (space_or_tab_count * tab_count)

            pretty_xml = pretty_xml + line
            last_tag = XmlTag.closing

        elif line.__contains__(XmlTag.self_closing_tag):

            # Print self closing on next line with one indentation from parent node
            pretty_xml = pretty_xml + '\n' + (space_or_tab_count * (tab_count+1)) + line
            last_tag = XmlTag.self_closing

        elif line.__contains__(XmlTag.opening_tag):

            # For consecutive opening tags increase the indentation
            if last_tag == XmlTag.opening:
                tab_count = tab_count + 1

            # Move opening element to next line
            if last_tag == XmlTag.opening or last_tag == XmlTag.closing:
                pretty_xml = pretty_xml + '\n' + (space_or_tab_count * tab_count)

            pretty_xml = pretty_xml + line
            last_tag = XmlTag.opening

    return pretty_xml

pretty_xml = to_pretty_xml("simple.xml")

with open("pretty.xml", 'w') as f:
    f.write(pretty_xml)
Baronetage answered 11/3, 2019 at 11:33 Comment(0)
A
0

This gives me nice XML on Python 3.6, haven't tried on Windows:

dom = xml.dom.minidom.parseString(xml_string)

pretty_xml_as_string = dom.toprettyxml(newl='').replace("\n\n", "\n")
Arianearianie answered 12/11, 2020 at 5:40 Comment(0)
A
-1

Are you viewing the resulting file on Windows? If so, try using toprettyxml(newl='\r\n').

Apocryphal answered 2/11, 2009 at 17:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.