Remove whitespaces in XML string
Asked Answered
T

8

31

How can I remove the whitespaces and line breaks in an XML string in Python 2.6? I tried the following packages:

etree: This snippet keeps the original whitespaces:

xmlStr = '''<root>
    <head></head>
    <content></content>
</root>'''

xmlElement = xml.etree.ElementTree.XML(xmlStr)
xmlStr = xml.etree.ElementTree.tostring(xmlElement, 'UTF-8')
print xmlStr

I can not use Python 2.7 which would provide the method parameter.

minidom: just the same:

xmlDocument = xml.dom.minidom.parseString(xmlStr)
xmlStr = xmlDocument.toprettyxml(indent='', newl='', encoding='UTF-8')
Triglyceride answered 22/7, 2010 at 15:34 Comment(1)
This may help using lxml to remove all blank lines and white-spaces from text node https://mcmap.net/q/470445/-python-how-to-strip-white-spaces-from-xml-text-nodesClamshell
F
50

The easiest solution is probably using lxml, where you can set a parser option to ignore white space between elements:

>>> from lxml import etree
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> xml_str = '''<root>
>>>     <head></head>
>>>     <content></content>
>>> </root>'''
>>> elem = etree.XML(xml_str, parser=parser)
>>> print etree.tostring(elem)
<root><head/><content/></root>

This will probably be enough for your needs, but some warnings to be on the safe side:

This will just remove whitespace nodes between elements, and try not to remove whitespace nodes inside elements with mixed content:

>>> elem = etree.XML('<p> spam <a>ham</a> <a>eggs</a></p>', parser=parser)
>>> print etree.tostring(elem)
<p> spam <a>ham</a> <a>eggs</a></p>

Leading or trailing whitespace from textnodes will not be removed. It will however still in some circumstances remove whitespace nodes from mixed content: if the parser has not encountered non-whitespace nodes at that level yet.

>>> elem = etree.XML('<p><a> ham</a> <a>eggs</a></p>', parser=parser)
>>> print etree.tostring(elem)
<p><a> ham</a><a>eggs</a></p>

If you don't want that, you can use xml:space="preserve", which will be respected. Another option would be using a dtd and use etree.XMLParser(load_dtd=True), where the parser will use the dtd to determine which whitespace nodes are significant or not.

Other than that, you will have to write your own code to remove the whitespace you don't want (iterating descendants, and where appropriate, set .text and .tail properties that contain only whitespace to None or empty string)

Folium answered 23/7, 2010 at 9:39 Comment(3)
I have found that, as pointed out by @Steven, some elements containing only whitespaces are not cleaned. I have used a regex to do so after the call to etree.tostring: re.sub(r'>\s+<', '><', xml_str)Plumb
Please replace etree.XML(xml_str, parser=p) with etree.XML(xml_str, parser=parser) in first snippet.Countersink
If you can use Python 3.9, there's a new function xml.etree.ElementTree.indent that can also help address this without the need for any lxml dependencies.Terrorize
Q
33

Here's something quick I came up with because I didn't want to use lxml:

from xml.dom import minidom
from xml.dom.minidom import Node

def remove_blanks(node):
    for x in node.childNodes:
        if x.nodeType == Node.TEXT_NODE:
            if x.nodeValue:
                x.nodeValue = x.nodeValue.strip()
        elif x.nodeType == Node.ELEMENT_NODE:
            remove_blanks(x)

xml = minidom.parse('file.xml')
remove_blanks(xml)
xml.normalize()
with file('file.xml', 'w') as result:
    result.write(xml.toprettyxml(indent = '  '))

Which I really only needed to re-indent the XML file with otherwise broken indentation. It doesn't respect the preserve directive, but, honestly, so do so many other software dealing with XMLs, that it's rather a funny requirement :) Also, you'd be able to easily add that sort of functionality to the code above (just check for space attribute, and don't recure if its value is 'preserve'.)

Quelpart answered 4/6, 2013 at 13:23 Comment(3)
Thanks for this - I didn't want to add lxml to my project and this worked perfectly for my needs.Baun
Awesome ! This really helpedAnalcite
This is a great answer with no external dependencies!Kirsti
D
7

Whitespace is significant within an XML document. Using whitespace for indentation is a poor use of XML, as it introduces significant data where there really is none -- and sadly, this is the norm. Any programmatic approach you take to stripping out whitespace will be, at best, a guess - you need better knowledge of what the XML is conveying to properly remove whitespace, without stepping on some piece of data's toes.

Duality answered 22/7, 2010 at 22:44 Comment(1)
Sometimes it is, sometimes it's not. Specifically, if there's a DTD or Schema it's probably not significant. See discussion at oracle.com/technetwork/articles/wang-whitespace-092897.htmlLenette
H
2
xmlStr = xmlDocument.toprettyxml(indent='\t', newl='\n', encoding='UTF-8')
fix = re.compile(r'((?<=>)(\n[\t]*)(?=[^<\t]))|(?<=[^>\t])(\n[\t]*)(?=<)')
newXmlStr = re.sub(fix, '', xmlStr )

from this source

Huei answered 30/4, 2015 at 18:12 Comment(1)
what is re? For me re is not definedSoulless
P
2

The only thing that bothers me about xml.dom.minidom's toprettyxml() is that it adds blank lines. I don't seem to get the split components, so I just wrote a simple function to remove the blank lines:

#!/usr/bin/env python

import xml.dom.minidom

# toprettyxml() without the blank lines
def prettyPrint(x):
    for line in x.toprettyxml().split('\n'):
        if not line.strip() == '':
            print line

xml_string = "<monty>\n<example>something</example>\n<python>parrot</python>\n</monty>"

# parse XML
x = xml.dom.minidom.parseString(xml_string)

# clean
prettyPrint(x)

And this is what the code outputs:

<?xml version="1.0" ?>
<monty>
        <example>something</example>
        <python>parrot</python>
</monty>

If I use toprettyxml() by itself, i.e. print(toprettyxml(x)), it adds unnecessary blank lines:

<?xml version="1.0" ?>
<monty>


        <example>something</example>


        <python>parrot</python>


</monty>
Parenthood answered 31/8, 2015 at 9:41 Comment(0)
V
-1

A little clumsy solution without lxml :-)

data = """<root>

    <head></head>    <content></content>

</root>"""

data3 = []
data2 = data.split('\n')
for x in data2:
    y = x.strip()
    if y: data3.append(y)
data4 = ''.join(data3)
data5 = data4.replace("  ","").replace("> <","><")

print data5

Output: <root><head></head><content></content></root>
Venus answered 9/11, 2012 at 20:54 Comment(0)
H
-1

If whitespace in "non-leaf" nodes is what we're trying to remove then the following function will do it (recursively if specified):

from xml.dom import Node

def stripNode(node, recurse=False):
    nodesToRemove = []
    nodeToBeStripped = False

    for childNode in node.childNodes:
        # list empty text nodes (to remove if any should be)
        if (childNode.nodeType == Node.TEXT_NODE and childNode.nodeValue.strip() == ""):
            nodesToRemove.append(childNode)

        # only remove empty text nodes if not a leaf node (i.e. a child element exists)
        if childNode.nodeType == Node.ELEMENT_NODE:
            nodeToBeStripped = True

    # remove flagged text nodes
    if nodeToBeStripped:
        for childNode in nodesToRemove:
            node.removeChild(childNode)

    # recurse if specified
    if recurse:
        for childNode in node.childNodes:
            stripNode(childNode, True)

However, Thanatos is correct. Whitespace can represent data in XML so use with caution.

Hanse answered 22/1, 2013 at 3:42 Comment(0)
M
-3
xmlStr = ' '.join(xmlStr.split()))

This puts all text in one line replacing multiple white space with single blank.

xmlStr = ''.join(xmlStr.split()))

This would remove completely space including the spaces inside the text and can not be used.

The first form could be used with risk (but that you request), for the input you gave:

xmlStr = '''<root>
    <head></head>
    <content></content>
</root>'''
xmlStr = ' '.join(xmlStr.split())
print xmlStr
""" Output:
<root> <head></head> <content></content> </root>
"""

This would be valid xml. It would need to be though checked with some kind of xml checker maybe. Are you by the way sure you want XML? Have you read the article: Python Is Not Java

Musil answered 22/7, 2010 at 15:45 Comment(2)
-1 Your suggestion will trash anything like """<v xml:space="preserve">\t\tfoo</v>"""Subzero
I'm going to have to agree with John. This doesn't preserve the XML syntax at all.Brundage

© 2022 - 2024 — McMap. All rights reserved.