How do I fix wrongly nested / unclosed HTML tags?
Asked Answered
V

6

21

I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.

For example, something like

<p>
  <ul>
    <li>Foo

becomes

<p>
  <ul>
    <li>Foo</li>
  </ul>
</p>

Any help would be appreciated :)

Vtehsta answered 16/11, 2008 at 4:14 Comment(1)
This is of course - generally - a "bad idea"(TM). Fixing the tags for the user may and may not yield what he intended. I'd rather validate the input, reject the update and tell the user as much as I could about what I think is wrong (suggesting fixes, but not doing them automatically). BESIDES! Your example shows my point: <ul> is NOT ALLOWED inside <p>, so your "fix" actually repairs nothing.Wreckage
C
33

using BeautifulSoup:

from BeautifulSoup import BeautifulSoup
html = "<p><ul><li>Foo"
soup = BeautifulSoup(html)
print soup.prettify()

gets you

<p>
 <ul>
  <li>
   Foo
  </li>
 </ul>
</p>

As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.

using Tidy:

import tidy
html = "<p><ul><li>Foo"
print tidy.parseString(html, show_body_only=True)

gets you

<ul>
<li>Foo</li>
</ul>

Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing

print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)

comes out as

<p></p>
<ul>
<li>Foo</li>
</ul>

Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.

Finally, Tidy can also do indenting:

print tidy.parseString(html, show_body_only=True, indent=True)

becomes

<ul>
  <li>Foo
  </li>
</ul>

All of these have their ups and downs, but hopefully one of them is close enough.

Carpentry answered 16/11, 2008 at 6:5 Comment(3)
The reason tidy sees it as an empty element is because p-elements are not allowed to contain ul-elements.Turret
P-elements can only contain inline elements like a, abbr, acronym, b, bdo, big, br, button, cite, code, del, dfn, em, i, img, input, ins, kbd, label, map, object, q, samp, script select, small, span, strong, sub, sup, textarea, tt and var.Turret
I would recommend having lxml installed when using BeautifulSoup as this appears to greatly help with markup repair (pip install lxml). BeautifulSoup will automatically choose lxml first for html parsing if availableBowse
W
10

Run it through Tidy or one of its ported libraries.

Try to code it by hand and you will want to gouge your eyes out.

Weasner answered 16/11, 2008 at 4:17 Comment(0)
A
10

use html5lib, work great! like this.

soup = BeautifulSoup(data, 'html5lib')

Aquamanile answered 23/8, 2017 at 7:8 Comment(1)
You're the man. I've been trying to figure out why I couldn't parser a particular website. Adding 'html5lib' to soup(page_html, 'html.parser'') did wonders. :)Stonebroke
J
2

I tried to use, below method but Failed on python 3

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(page, 'html5lib')

I tried below and got Success

soup = bs4.BeautifulSoup(html, 'html5lib')
f_html = soup.prettify()
print(f'Formatted html::: {f_html}')
Jersey answered 12/11, 2018 at 15:19 Comment(0)
A
1

Just now, I got a html which lxml and pyquery didn't work good on , seems there are some errors in the html. Since Tidy is not easy to install in windows, I choose BeautifulSoup. But I found that:

from BeautifulSoup import BeautifulSoup
import lxml.html
soup = BeautifulSoup(page)
h = lxml.html(soup.prettify())

act same as h = lxml.html(page)

Which real solve my problem is soup = BeautifulSoup(page, 'html5lib').
You should install html5lib first, then can use it as a parser in BeautifulSoup. html5lib parser seems work much better than others.

Hope this can help someone.

Amerson answered 17/9, 2015 at 9:38 Comment(0)
H
0

Encountered this issue in 2024 and found that none of the given solutions fixed my incorrectly nested tags.

I came up with my own solution (no third party library required). Please see ipyslides.xmd.TagFixer.

Full solution:

from html.parser import HTMLParser

class TagFixer(HTMLParser):
    "Use self.fix_html function."
    def handle_starttag(self, tag, attrs): 
        self._objs.append(f'{tag}')

    def handle_endtag(self, tag):
        if self._objs and self._objs[-1] == tag:
            self._objs.pop() # tag properly closed
        else:
            self._objs.append(f'/{tag}')

    def _fix_tags(self, content):
        tags = self._objs[::-1]  # Reverse order is important
        end_tags = [f"</{tag}>" for tag in tags if not tag.startswith('/')]
        start_tags = [f"<{tag.lstrip('/')}>" for tag in tags if tag.startswith('/')]
        return ''.join(start_tags) + content + ''.join(end_tags)
    
    def _remove_empty_tags(self, content):
        empty_tags = re.compile(r'\<\s*(.*?)\s*\>\s*\<\s*\/\s*(\1)\s*\>') # keeps tags with attributes
        i = 0
        while empty_tags.findall(content) and i <= 5: # As deep as 5 nested empty tags
            content = empty_tags.sub('', content).strip() # empty tags removed after fix
            i += 1
        return content

    def fix_html(self, content):
        self._objs = []
        self.feed(content)
        self.close()

        if not self._objs:
            return self._remove_empty_tags(content) # Already correct
        return self._remove_empty_tags(self._fix_tags(content))

tagfixer = TagFixer() 

tagfixer.fix_html("""
A messed up html, that can have 
closed tags without opened tags 
and it will clear empty tags 
upto 5 levels!""")
Hairball answered 24/7 at 1:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.