How to Pretty Print HTML to a file, with indentation
Asked Answered
S

11

110

I am using lxml.html to generate some HTML. I want to pretty print (with indentation) my final result into an html file. How do I do that?

This is what I have tried and got till now

import lxml.html as lh
from lxml.html import builder as E
sliderRoot=lh.Element("div", E.CLASS("scroll"), style="overflow-x: hidden; overflow-y: hidden;")
scrollContainer=lh.Element("div", E.CLASS("scrollContainer"), style="width: 4340px;")
sliderRoot.append(scrollContainer)
print lh.tostring(sliderRoot, pretty_print = True, method="html")

As you can see I am using the pretty_print=True attribute. I thought that would give indented code, but it doesn't really help. This is the output :

<div style="overflow-x: hidden; overflow-y: hidden;" class="scroll"><div style="width: 4340px;" class="scrollContainer"></div></div>

Swane answered 27/5, 2011 at 9:9 Comment(0)
S
149

I ended up using BeautifulSoup directly. That is something lxml.html.soupparser uses for parsing HTML.

BeautifulSoup has a prettify method that does exactly what it says it does. It prettifies the HTML with proper indents and everything.

BeautifulSoup will NOT fix the HTML, so broken code, remains broken. But in this case, since the code is being generated by lxml, the HTML code should be at least semantically correct.

In the example given in my question, I will have to do this :

from bs4 import BeautifulSoup as bs
root = lh.tostring(sliderRoot) #convert the generated HTML to a string
soup = bs(root)                #make BeautifulSoup
prettyHTML = soup.prettify()   #prettify the html
Swane answered 29/5, 2011 at 11:14 Comment(5)
Thank you, but worth to mention that js embedded to html will not prettified, if it is important to somebody.Jolley
With version 4 change the first line to from bs4 import BeautifulSoup as bsAid
If you just want to prettify html from a string, see AlexG's answer below.Manipulator
Be careful with prettify, as it changes document semantics: "Since it adds whitespace (in the form of newlines), prettify() changes the meaning of an HTML document and should not be used to reformat one. The goal of prettify() is to help you visually understand the structure of the documents you work with."Witty
Another warning: With version 4, BeautifulSoup will DECODE html entities, so if you were decoding strings with user-posted content (eg.: forum posts), it will be happy to reverse escaped HTML back, opening you to potential problems.Becoming
B
51

Though my answer might not be helpful now, I am dropping it here to act as a reference to anybody else in future.

lxml.html.tostring(), indeed, doesn't pretty print the provided HTML in spite of pretty_print=True.

However, the "sibling" of lxml.html - lxml.etree has it working well.

So one might use it as following:

from lxml import etree, html

document_root = html.fromstring("<html><body><h1>hello world</h1></body></html>")
print(etree.tostring(document_root, encoding='unicode', pretty_print=True))

The output is like this:

<html>
  <body>
    <h1>hello world</h1>
  </body>
</html>
Babu answered 12/5, 2013 at 9:1 Comment(4)
The pretty_print flag only works when calling etree.tostring with method='xml', which is the default. So, we are dealing with XHTML here.Broken
This is an excellent answer, because it doesn't use any external dependencies. However, if the string containing HTML has carriage returns, etree.tostring pretties nothing, and returns its input, unchanged, on Python 2.7.10 at least ... once you know, it's a simple matter to replace the carriage returns, but you'll waste a lot of time if you don't know this.Kathe
This is great because it only provides a solution to tabs. This does not alter the HTML in other ways like BeautifulSoup solutions.Aw
NOPE! And here's why. etree.tostring will shorten "<i></i>" to "<i/>" which is not allowed.Scrappy
P
28

If you store the HTML as an unformatted string, in a variable html_string, it can be done using beautifulsoup4 as follows:

from bs4 import BeautifulSoup
print(BeautifulSoup(html_string, 'html.parser').prettify())
Pitchfork answered 13/11, 2017 at 7:26 Comment(1)
I've just tried this method to reformat legacy html, and the result is visually different, particularly regarding vertical spacing. Not saying the original html syntax was correct to start with, but be warned that this does not guarantee the same visual output.Glanville
S
7

If adding one more dependency is not a problem, you can use the html5print package. The advantage over the other solutions, is that it also beautifies both CSS and Javascript code embedded in the HTML document.

To install it, execute:

pip install html5print

Then, you can either use it as a command:

html5-print ugly.html -o pretty.html

or as Python code:

from html5print import HTMLBeautifier
html = '<title>Page Title</title><p>Some text here</p>'
print(HTMLBeautifier.beautify(html, 4))
Sutra answered 19/3, 2018 at 20:49 Comment(1)
this installs several other dependencies including beautifulsoup4Cichlid
U
5

I tried both BeautifulSoup's prettify and html5print's HTMLBeautifier solutions but since I'm using yattag to generate HTML it seems more appropriate to use its indent function, which produces nicely indented output.

from yattag import indent

rawhtml = "String with some HTML code..."

result = indent(
    rawhtml,
    indentation = '    ',
    newline = '\r\n',
    indent_text = True
)

print(result)
Usurious answered 6/5, 2018 at 8:2 Comment(0)
P
4

Under the hood, lxml uses libxml2 to serialize the tree back into a string. Here is the relevant snippet of code that determines whether to append a newline after closing a tag:

    xmlOutputBufferWriteString(buf, ">");
    if ((format) && (!info->isinline) && (cur->next != NULL)) {
        if ((cur->next->type != HTML_TEXT_NODE) &&
            (cur->next->type != HTML_ENTITY_REF_NODE) &&
            (cur->parent != NULL) &&
            (cur->parent->name != NULL) &&
            (cur->parent->name[0] != 'p')) /* p, pre, param */
            xmlOutputBufferWriteString(buf, "\n");
    }
    return;

So if a node is an element, is not an inline tag and is followed by a sibling node (cur->next != NULL) and isn't one of p, pre, param then it will output a newline.

Polliwog answered 27/5, 2011 at 15:40 Comment(0)
E
3

Couldn't you just pipe it into HTML Tidy? Either from the shell or through os.system().

Engrossment answered 27/5, 2011 at 13:14 Comment(3)
I initially thought of using HTML Tidy, but my code is slightly quirky and tidy just ends up playing havoc with it. Decided to use BeautifulSoup instead. Worked like a charm.Swane
HTML Tidy corrects your HTML which can break things. Such errors are pretty hard to find if you're forgetting that HTML Tidy is processing the results (I know what I'm talking about)...Parotid
More recently than the 2011 comments here, see the answer to this 2018 question: #50381299. "That library is broken and/or doesn't work with python 3.5." May save someone a bit of time...Claytonclaytonia
U
2

If you don't care about quirky HTMLness (e.g. you must support absolutely support those hordes of Netscpae 2.0-using clients, so having <br> instead of <br /> is a must), you can always change your method to "xml", which seems to work. This is probably a bug in lxml or in libxml, but I couldn't find the reason for it.

Uncalledfor answered 27/5, 2011 at 12:56 Comment(1)
When you set the method to xml, if a tag does not have any sub-elements, then the closing tag is not generated. For instance, in the example in question, the inner div will not have a closing tag. I don't really know why. I ended up using the BeautifulSoup to get a proper output.Swane
E
2

not really my code, I picked it somewhere

def indent(elem, level=0):
    i = '\n' + level * '  '
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + '  '
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

I use it with:

indent(page)
tostring(page)
Exponible answered 27/5, 2011 at 15:22 Comment(1)
If it's not your code, please cite where it's from. What's tostring()?Indigene
C
0

This is crude and not very robust but it will balance out the example html string, without using any non-standard libraries

import re
html = """
<A value="X"><B value="X">
<A value="X"><B value="X">
some random text
</B></A><C />
some random text
</B></A><C />
"""
rx_al = r"(<[\/A-Za-z]+[^>]*>)"
rx_op = r"<([A-Za-z]+)[^\/]+>"
rx_cl = r"</([A-Za-z]+)>"
# self-closing not needed
#rx_sc = r"<([A-Za-z]+).* \/>"
matches = re.findall(rx_al, html, flags=re.M)

def lookup_key(match, indent):
  return f"{match[0]}:{indent}"

def balance(nodes):
  builder = []
  indent = 0
  lookup = {}
  for node in nodes:
    is_open = re.match(rx_op, node)
    is_close = re.match(rx_cl, node)
    padl = " " * indent
    if is_open:
      k = lookup_key(is_open, indent)
      lookup[k] = node
      indent += 2
    elif is_close:
      for i in range(0, indent, -2):
        if lookup.pop(lookup_key(is_close, i), None):
          break
      indent -= 2
      padl = padl[:-2]
    builder.append(padl + node)
  return "\n".join(builder)

print(balance(matches))

will produce:

<A value="X">
  <B value="X">
    <A value="X">
      <B value="X">
      </B>
    </A>
    <C />
  </B>
</A>
<C />
Contraception answered 23/3, 2023 at 7:28 Comment(0)
I
0

Customize indented tags and text (by default, etree.indent cannot indent text).

from lxml import etree
import textwrap

def etree_pretty(root, space="\t"):
    for elem in root.iterdescendants():
        if elem.text:
            depth = int(elem.xpath("count(ancestor::*)")) + 1
            temp_text = textwrap.dedent(elem.text).strip()
            elem.text = (
                    "\n"
                    + textwrap.indent(temp_text, prefix=depth * space)
                    + ("\n" + (depth-1) * space)
            )
    etree.indent(root, space=space)

html_str = """
<html><body><h1>hello world</h1><h2>sub title</h2></body></html>
"""
root = etree.HTML(html_str)
etree_pretty(root)
print(etree.tostring(root, encoding="unicode"))

Print:

<html>
    <body>
        <h1>
            hello world
        </h1>
        <h2>
            sub title
        </h2>
    </body>
</html>
Interlinear answered 22/9, 2023 at 1:35 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.