Custom indent width for BeautifulSoup .prettify()
Asked Answered
A

4

38

Is there any way to define custom indent width for .prettify() function? From what I can get from it's source -

def prettify(self, encoding=None, formatter="minimal"):
    if encoding is None:
        return self.decode(True, formatter=formatter)
    else:
        return self.encode(encoding, True, formatter=formatter)

There is no way to specify indent width. I think it's because of this line in the decode_contents() function -

s.append(" " * (indent_level - 1))

Which has a fixed length of 1 space! (WHY!!) I tried specifying indent_level=4, that just results in this -

    <section>
     <article>
      <h1>
      </h1>
      <p>
      </p>
     </article>
    </section>

Which looks just plain stupid. :|

Now, I can hack this away, but I just want to be sure if there is anything I'm missing. Because this should be a basic feature. :-/

If you have some better way of prettifying HTML codes, let me know.

Anthelion answered 19/3, 2013 at 20:7 Comment(3)
In answer to your side question ("WHY!"): HTML and XML tend to be very, very deeply nested, and I'm guessing the Crummy guys like 80-column windows. But you might want to post to the mailing list/group and/or file a bug requesting this feature (and, since the patch is pretty simple—and ramabodhi already pretty much wrote it for you—you should include it with your email/bug report).Reportage
It looks like someone submitted a similar patch against 3.2 to the mailing list a couple years ago. See here.Reportage
"1-space indent looks just plain stupid. :|" - Thank you. This is exactly what I was thinking when I was searching for this issue.Apoloniaapolune
R
27

I actually dealt with this myself, in the hackiest way possible: by post-processing the result.

r = re.compile(r'^(\s*)', re.MULTILINE)
def prettify_2space(s, encoding=None, formatter="minimal"):
    return r.sub(r'\1\1', s.prettify(encoding, formatter))

Actually, I monkeypatched prettify_2space in place of prettify in the class. That's not essential to the solution, but let's do it anyway, and make the indent width a parameter instead of hardcoding it to 2:

orig_prettify = bs4.BeautifulSoup.prettify
r = re.compile(r'^(\s*)', re.MULTILINE)
def prettify(self, encoding=None, formatter="minimal", indent_width=4):
    return r.sub(r'\1' * indent_width, orig_prettify(self, encoding, formatter))
bs4.BeautifulSoup.prettify = prettify

So:

x = '''<section><article><h1></h1><p></p></article></section>'''
soup = bs4.BeautifulSoup(x)
print(soup.prettify(indent_width=3))

… gives:

<html>
   <body>
      <section>
         <article>
            <h1>
            </h1>
            <p>
            </p>
         </article>
      </section>
   </body>
</html>

Obviously if you want to patch Tag.prettify as well as BeautifulSoup.prettify, you have to do the same thing there. (You might want to create a generic wrapper that you can apply to both, instead of repeating yourself.) And if there are any other prettify methods, same deal.

Reportage answered 20/3, 2013 at 1:6 Comment(2)
breaks on <pre>Ergotism
return re.sub(r'\n(\s+)', r'\n{}'.format(r"\1"*indent_width), orig_prettify(...)) with python 3.9Brainless
S
13

Beautiful Soup has output formatters. bs4.formatter.HTMLFormatter allows to specify indent.

>>> import bs4
>>> s = '<section><article><h1></h1><p></p></article></section>'
>>> formatter = bs4.formatter.HTMLFormatter(indent=2)
>>> print(bs4.BeautifulSoup(s, 'html.parser').prettify(formatter=formatter))
<section>
  <article>
    <h1>
    </h1>
    <p>
    </p>
  </article>
</section>

You can also use it from command-line with pyfil (e.g. to integrate with Geany's "Send Selection to" feature):

pyfil 'bs4.BeautifulSoup(stdin, "html.parser").prettify(formatter=bs4.formatter.HTMLFormatter(indent=2))'
Scandalize answered 24/6, 2022 at 16:3 Comment(3)
I get the following error: TypeError: __init__() got an unexpected keyword argument 'indent'. I've searched far and wide across the internet but nothing came of it.Gothar
You must be using an old version of the package. indent was added in this commit. According to the changelog it's released since version 4.11.0.Scandalize
Turns out pip install does not automatically upgrade to the latest versions if the package is already installed. pip install -U is needed for this job. Thanks @ScandalizeGothar
P
6

As far as I can tell, this feature is not built in, as there are a handful of solutions out there for this problem.

Assuming you are using BeautifulSoup 4, here are the solutions I came up with

Hardcode it in. This requires minimal changes, this is fine if you don't need the indent to be different in different circumstances:

myTab = 4 # add this
if pretty_print:
   # space = (' ' * (indent_level - 1))
    space = (' ' * (indent_level - myTab))
    #indent_contents = indent_level + 1
    indent_contents = indent_level + myTab 

Another problem with the previous solution is that the text content wont be indented entirely consistently, but attractively, still. If you need a more flexible/consistent solution, you can just modify the class.

Find the prettify function and modify it as such (it is located in the Tag class in element.py):

#Add the myTab keyword to the functions parameters (or whatever you want to call it), set it to your preferred default.
def prettify(self, encoding=None, formatter="minimal", myTab=2): 
    Tag.myTab= myTab # add a reference to it in the Tag class
    if encoding is None:
        return self.decode(True, formatter=formatter)
    else:
        return self.encode(encoding, True, formatter=formatter)

And then scroll up to the decode method in the Tag class and make the following changes:

if pretty_print:
    #space = (' ' * (indent_level - 1))
    space = (' ' * (indent_level - Tag.myTab))
    #indent_contents = indent_level + Tag.myTab 
    indent_contents = indent_level + Tag.myTab

Then go to the decode_contents method in the Tag class and make these changes:

#s.append(" " * (indent_level - 1))
s.append(" " * (indent_level - Tag.myTab))

Now BeautifulSoup('<root><child><desc>Text</desc></child></root>').prettify(myTab=4) will return:

<root>
    <child>
        <desc>
            Text
        </desc>
    </child>
</root>

**No need to patch BeautifulSoup class as it inherits the Tag class. Patching Tag class is sufficient enough to achieve the goal.

Panzer answered 20/3, 2013 at 0:59 Comment(2)
This should be very easy to convert into a patch against the bs4 source tree, which is handy. The OP can just make his own fork of the bzr tree and patch it, submit the patch upstream, etc.Reportage
Thanks guys. I just couldn't believe only one person had a problem with this in these years and proposed a patch, and it is still not merged. I have already modified the function to take variable length(as I hate hard coding things). It pretty much does what you have suggested. But the thing is you need to provide something for indent_level because of this line pretty_print = (indent_level is not None) And as I see the default value of indent_level is None and there is no dynamic way to change it. <_<Anthelion
M
3

Here's a way to increase indentation w/o meddling with original functions, etc. Create the following function:

# Increase indentation of 'text' by 'n' spaces
def add_indent(text,n):
  sp = " "*n
  lsep = chr(10) if text.find(chr(13)) == -1 else chr(13)+chr(10)
  lines = text.split(lsep)
  for i in range(len(lines)):
    spacediff = len(lines[i]) - len(lines[i].lstrip())
    if spacediff: lines[i] = sp*spacediff + lines[i] 
  return lsep.join(lines)

Then convert the text you obtained using the above function:

x = '''<section><article><h1></h1><p></p></article></section>'''
soup = bs4.BeautifulSoup(x, 'html.parser')  # I don't know if you need 'html.parser'
text = soup.prettify()                      # I do, otherwise I get a warning
text = add_indent(text,1) # Increase indentation by 1 space 
print(text)
'''
Output:
<html>
  <body>
    <section>
      <article>
        <h1>
        </h1>
        <p>
        </p>
      </article>
    </section>
  </body>
</html>
'''
Monogenesis answered 27/7, 2020 at 6:8 Comment(3)
breaks on <pre>Ergotism
@Mila Nautijus, <pre>???Monogenesis
yes, <pre>. whitespace must be preserved in pre tagsErgotism

© 2022 - 2024 — McMap. All rights reserved.