How to output XML from BeautifulSoup without extraneous newlines?
Asked Answered
A

4

6

I'm using Python and BeautifulSoup to parse and access elements from an XML document. I modify the values of a couple of the elements and then write the XML back into the file. The trouble is that the updated XML file contains newlines at the start and end of each XML element's text values, resulting in a file that looks like this:

<annotation>
 <folder>
  Definitiva
 </folder>
 <filename>
  armas_229.jpg
 </filename>
 <path>
  /tmp/tmpygedczp5/handgun/images/armas_229.jpg
 </path>
 <size>
  <width>
   1800
  </width>
  <height>
   1426
  </height>
  <depth>
   3
  </depth>
 </size>
 <segmented>
  0
 </segmented>
 <object>
  <name>
   handgun
  </name>
  <pose>
   Unspecified
  </pose>
  <truncated>
   0
  </truncated>
  <difficult>
   0
  </difficult>
  <bndbox>
   <xmin>
    1001
   </xmin>
   <ymin>
    549
   </ymin>
   <xmax>
    1453
   </xmax>
   <ymax>
    1147
   </ymax>
  </bndbox>
 </object>
</annotation>

Instead I'd rather have the output file look like this:

<annotation>
 <folder>Definitiva</folder>
 <filename>armas_229.jpg</filename>
 <path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
 <size>
  <width>1800</width>
  <height>1426</height>
  <depth>3</depth>
 </size>
 <segmented>0</segmented>
 <object>
  <name>handgun</name>
  <pose>Unspecified</pose>
  <truncated>0</truncated>
  <difficult>0</difficult>
  <bndbox>
   <xmin>1001</xmin>
   <ymin>549</ymin>
   <xmax>1453</xmax>
   <ymax>1147</ymax>
  </bndbox>
 </object>
</annotation>

I open the file and get the "soup" like so:

    with open(pascal_xml_file_path) as pascal_file:
        pascal_contents = pascal_file.read()
    soup = BeautifulSoup(pascal_contents, "xml")

After I've completed modifying a couple of the document's values I rewrite the document back into the file using BeautifulSoup.prettify like so:

    with open(pascal_xml_file_path, "w") as pascal_file:
        pascal_file.write(soup.prettify())

My assumption is that the BeautifulSoup.prettify is adding these superfluous/gratuitous newlines by default, and there doesn't appear to be a good way to modify this behavior. Have I missed something in the BeautifulSoup documentation, or am I truly unable to modify this behavior and need to use another approach for outputting the XML to file? Maybe I'm just better off rewriting this using xml.etree.ElementTree instead?

Austenite answered 11/10, 2019 at 16:2 Comment(0)
A
0

It turns out to be straight-forward to get the indentation I want if I instead use xml.etree.ElementTree instead of BeautifulSoup. For example, below is some code that reads an XML file, cleans off any newlines/whitespace from text elements, and then writes the tree as an XML file.

import argparse
from xml.etree import ElementTree


# ------------------------------------------------------------------------------
def reformat(
        input_xml: str,
        output_xml: str,
):
    tree = ElementTree.parse(input_xml)

    # remove extraneous newlines and whitespace from text elements
    for element in tree.getiterator():
        if element.text:
            element.text = element.text.strip()

    # write the updated XML into the annotations output directory
    tree.write(output_xml)


# ------------------------------------------------------------------------------
if __name__ == "__main__":

    # parse the command line arguments
    args_parser = argparse.ArgumentParser()
    args_parser.add_argument(
        "--in",
        required=True,
        type=str,
        help="file path of original XML",
    )
    args_parser.add_argument(
        "--out",
        required=True,
        type=str,
        help="file path of reformatted XML",
    )
    args = vars(args_parser.parse_args())

    reformat(
        args["in"],
        args["out"],
    )
Austenite answered 25/10, 2019 at 21:31 Comment(0)
B
2

My assumption is that the BeautifulSoup.prettify is adding these superfluous/gratuitous newlines by default, and there doesn't appear to be a good way to modify this behavior.

YES

It is doing so in two methods of the bs4.Tag class decode and decode_contents.

Ref: Source file on github

If you just need a temporary fix, you can monkey patch these two methods

Here is my implementation

from bs4 import Tag, NavigableString, BeautifulSoup
from bs4.element import AttributeValueWithCharsetSubstitution, EntitySubstitution


def decode(
    self, indent_level=None,
    eventual_encoding="utf-8", formatter="minimal"
):
    if not callable(formatter):
        formatter = self._formatter_for_name(formatter)

    attrs = []
    if self.attrs:
        for key, val in sorted(self.attrs.items()):
            if val is None:
                decoded = key
            else:
                if isinstance(val, list) or isinstance(val, tuple):
                    val = ' '.join(val)
                elif not isinstance(val, str):
                    val = str(val)
                elif (
                    isinstance(val, AttributeValueWithCharsetSubstitution)
                    and eventual_encoding is not None
                ):
                    val = val.encode(eventual_encoding)

                text = self.format_string(val, formatter)
                decoded = (
                    str(key) + '='
                    + EntitySubstitution.quoted_attribute_value(text))
            attrs.append(decoded)
    close = ''
    closeTag = ''
    prefix = ''
    if self.prefix:
        prefix = self.prefix + ":"

    if self.is_empty_element:
        close = '/'
    else:
        closeTag = '</%s%s>' % (prefix, self.name)

    pretty_print = self._should_pretty_print(indent_level)
    space = ''
    indent_space = ''
    if indent_level is not None:
        indent_space = (' ' * (indent_level - 1))
    if pretty_print:
        space = indent_space
        indent_contents = indent_level + 1
    else:
        indent_contents = None
    contents = self.decode_contents(
        indent_contents, eventual_encoding, formatter)

    if self.hidden:
        # This is the 'document root' object.
        s = contents
    else:
        s = []
        attribute_string = ''
        if attrs:
            attribute_string = ' ' + ' '.join(attrs)
        if indent_level is not None:
            # Even if this particular tag is not pretty-printed,
            # we should indent up to the start of the tag.
            s.append(indent_space)
        s.append('<%s%s%s%s>' % (
                prefix, self.name, attribute_string, close))
        has_tag_child = False
        if pretty_print:
            for item in self.children:
                if isinstance(item, Tag):
                    has_tag_child = True
                    break
            if has_tag_child:
                s.append("\n")
        s.append(contents)
        if not has_tag_child:
            s[-1] = s[-1].strip()
        if pretty_print and contents and contents[-1] != "\n":
            s.append("")
        if pretty_print and closeTag:
            if has_tag_child:
                s.append(space)
        s.append(closeTag)
        if indent_level is not None and closeTag and self.next_sibling:
            # Even if this particular tag is not pretty-printed,
            # we're now done with the tag, and we should add a
            # newline if appropriate.
            s.append("\n")
        s = ''.join(s)
    return s


def decode_contents(
    self,
    indent_level=None,
    eventual_encoding="utf-8",
    formatter="minimal"
):
    # First off, turn a string formatter into a function. This
    # will stop the lookup from happening over and over again.
    if not callable(formatter):
        formatter = self._formatter_for_name(formatter)

    pretty_print = (indent_level is not None)
    s = []
    for c in self:
        text = None
        if isinstance(c, NavigableString):
            text = c.output_ready(formatter)
        elif isinstance(c, Tag):
            s.append(
                c.decode(indent_level, eventual_encoding, formatter)
            )
        if text and indent_level and not self.name == 'pre':
            text = text.strip()
        if text:
            if pretty_print and not self.name == 'pre':
                s.append(" " * (indent_level - 1))
            s.append(text)
            if pretty_print and not self.name == 'pre':
                s.append("")
    return ''.join(s)


Tag.decode = decode
Tag.decode_contents= decode_contents

After this, when I did print(soup.prettify), the output was

<annotation>
 <folder>Definitiva</folder>
 <filename>armas_229.jpg</filename>
 <path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
 <size>
  <width>1800</width>
  <height>1426</height>
  <depth>3</depth>
 </size>
 <segmented>0</segmented>
 <object>
  <name>handgun</name>
  <pose>Unspecified</pose>
  <truncated>0</truncated>
  <difficult>0</difficult>
  <bndbox>
   <xmin>1001</xmin>
   <ymin>549</ymin>
   <xmax>1453</xmax>
   <ymax>1147</ymax>
  </bndbox>
 </object>
</annotation>

I made a lot of assumptions while doing this. Just wanted to show that it is possible.

Bombacaceous answered 11/10, 2019 at 19:36 Comment(1)
Very nice work, thanks @Bombacaceous Bennichan. I may rewrite my code using ElementTree as I'm hoping to not go as "off the reservation" as this and I don't think BeautifulSoup really buys me much that ElementTree doesn't offer (I used BS since it was used in a tutorial I was following for this task). Nice hack though, kudos!Austenite
M
1

Consider XSLT with Python's third-party module, lxml (which you possibly already have with BeautifulSoup integration). Specifically, call the identity transform to copy XML as is and then run the normalize-space() template on all text nodes.

XSLT (save as .xsl, a special .xml file or embedded string)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- IDENTITY TRANSFORM -->
    <xsl:template match="@*|node()">
      <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
    </xsl:template>

    <!-- RUN normalize-space() ON ALL TEXT NODES -->
    <xsl:template match="text()">
        <xsl:copy-of select="normalize-space()"/>
    </xsl:template>            
</xsl:stylesheet>

Python

import lxml.etree as et

# LOAD FROM STRING OR PARSE FROM FILE
str_xml = '''...'''    
str_xsl = '''...'''

doc = et.fromstring(str_xml)
style = et.fromstring(str_xsl)

# INITIALIZE TRANSFORMER AND RUN 
transformer = et.XSLT(style)
result = transformer(doc)

# PRINT TO SCREEN
print(result)

# SAVE TO DISK
with open('Output.xml', 'wb') as f:
     f.write(result)

Rextester demo

Megaron answered 11/10, 2019 at 18:36 Comment(5)
Thank you, @Parfait, this is helpful. I was hoping for something that just works out of the box instead of taking an approach like this, as this seems like almost as much work as rewriting my code to use ElementTree rather than BeautifulSoup. That assumes that the output from ElementTree will be formatted reasonably concisely unlike the default BeautifulSoup output, I guess there's only one way to find out (i.e. stop being lazy and do the rewrite using ElementTree). In any event, thanks again for your help.Austenite
Hmmm...saving a separate .xsl alongside .xml (or embed either as string as demo shows) is too much work? Without any loops or rewriting tree in Python? But understood. I don't know your full process just came to answer titled question. And I only use lxml for all my XML needs (even HTML) being a fully compliant XPath 1.0 and XSLT 1.0 library. Good luck! Maybe this can help future readers.Megaron
It turned out to be trivial to rewrite my code using ElementTree and the XML formatting looks as expected/desired when I use ElementTree.write(file_path). This seems to be the easiest/simplest, for me anyway...Austenite
Great to hear! Glad it worked out. Feel free to answer own question. Hopefully it does not involve loops.Megaron
After a little more testing it turns out I was mistaken and it's not quite as simple as this. Before I rewrite the XML I need to update all elements with text values using strip() to remove extraneous whitespace and newlines. Nevertheless xml.etree.ElementTree seems to be easier to use for this and so I've started using that instead of BeautifulSoup for this sort of thing.Austenite
A
0

It turns out to be straight-forward to get the indentation I want if I instead use xml.etree.ElementTree instead of BeautifulSoup. For example, below is some code that reads an XML file, cleans off any newlines/whitespace from text elements, and then writes the tree as an XML file.

import argparse
from xml.etree import ElementTree


# ------------------------------------------------------------------------------
def reformat(
        input_xml: str,
        output_xml: str,
):
    tree = ElementTree.parse(input_xml)

    # remove extraneous newlines and whitespace from text elements
    for element in tree.getiterator():
        if element.text:
            element.text = element.text.strip()

    # write the updated XML into the annotations output directory
    tree.write(output_xml)


# ------------------------------------------------------------------------------
if __name__ == "__main__":

    # parse the command line arguments
    args_parser = argparse.ArgumentParser()
    args_parser.add_argument(
        "--in",
        required=True,
        type=str,
        help="file path of original XML",
    )
    args_parser.add_argument(
        "--out",
        required=True,
        type=str,
        help="file path of reformatted XML",
    )
    args = vars(args_parser.parse_args())

    reformat(
        args["in"],
        args["out"],
    )
Austenite answered 25/10, 2019 at 21:31 Comment(0)
B
0

I wrote a code to do a prettification without any extra library.

Prettification logic

# Recursive function (do not call this method)
def _get_prettified(tag, curr_indent, indent):
    out =  ''
    for x in tag.find_all(recursive=False):
        if len(x.find_all()) == 0:
            content = x.string.strip(' \n')
        else:
            content = '\n' + _get_prettified(x, curr_indent + ' ' * indent, indent) + curr_indent
    
        attrs = ' '.join([f'{k}="{v}"' for k,v in x.attrs.items()])
        out += curr_indent + ('<%s %s>' % (x.name, attrs) if len(attrs) > 0 else '<%s>' % x.name) + content + '</%s>\n' % x.name
    
    return out 
    
# Call this method
def get_prettified(tag, indent):
    return _get_prettified(tag, '', indent);

Your input

source = """<annotation>
 <folder>
  Definitiva
 </folder>
 <filename>
  armas_229.jpg
 </filename>
 <path>
  /tmp/tmpygedczp5/handgun/images/armas_229.jpg
 </path>
 <size>
  <width>
   1800
  </width>
  <height>
   1426
  </height>
  <depth>
   3
  </depth>
 </size>
 <segmented>
  0
 </segmented>
 <object>
  <name>
   handgun
  </name>
  <pose>
   Unspecified
  </pose>
  <truncated>
   0
  </truncated>
  <difficult>
   0
  </difficult>
  <bndbox>
   <xmin>
    1001
   </xmin>
   <ymin>
    549
   </ymin>
   <xmax>
    1453
   </xmax>
   <ymax>
    1147
   </ymax>
  </bndbox>
 </object>
</annotation>"""

Output

bs = BeautifulSoup(source, 'html.parser')
output = get_prettified(bs, indent=2)
print(output)

# Prints following
<annotation>
  <folder>Definitiva</folder>
  <filename>armas_229.jpg</filename>
  <path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
  <size>
    <width>1800</width>
    <height>1426</height>
    <depth>3</depth>
  </size>
  <segmented>0</segmented>
  <object>
    <name>handgun</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficult>0</difficult>
    <bndbox>
      <xmin>1001</xmin>
      <ymin>549</ymin>
      <xmax>1453</xmax>
      <ymax>1147</ymax>
    </bndbox>
  </object>
</annotation>

Run your code here: https://replit.com/@bikcrum/BeautifulSoup-Prettifier

Bioluminescence answered 4/4, 2021 at 7:55 Comment(1)
I tried your code its working fine for html parser bs = BeautifulSoup(source, 'html.parser') but coverting all tags to lowercase. when I used bs = BeautifulSoup(source, 'xml') it is giving correct case tags but creating gap between lines.Brittani

© 2022 - 2024 — McMap. All rights reserved.