Remove HTML tags not on an allowed list from a Python string
Asked Answered
T

9

78

I have a string containing text and HTML. I want to remove or otherwise disable some HTML tags, such as <script>, while allowing others, so that I can render it on a web page safely. I have a list of allowed tags, how can I process the string to remove any other tags?

Tallith answered 30/3, 2009 at 23:25 Comment(3)
it should also remove all attributes not whitelisted... consider <img src="heh.png" onload="(function(){/* do bad stuff */}());" />Authorized
.. and also the useless empty tags and maybe consecutive br tagsStrung
Note that the first two answers are dangerous, because it's very easy to hide XSS from BS/lxml.Lunalunacy
T
50

Here's a simple solution using BeautifulSoup:

from bs4 import BeautifulSoup

VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']

def sanitize_html(value):

    soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True

    return soup.renderContents()

If you want to remove the contents of the invalid tags as well, substitute tag.extract() for tag.hidden.

You might also look into using lxml and Tidy.

Takin answered 30/3, 2009 at 23:35 Comment(6)
Thanks, I didn't need this ATM, but knew I would need to find something like this in the future.Ghazi
The import statement should probably be from BeautifulSoup import BeautifulSoup.Educatory
You may also want to limit the use of attributes. To do so, just add this to the solution above: valid_attrs = 'href src'.split() for ...: ... tag.attrs = [(attr, val) for attr, val in tag.attrs if attr in valid_attrs] hthGreek
This is not safe! See the answer by Chris Dost: #699968Palladian
This is awesome! One thing though, to install BeautifulSoap 4 run: easy_install beautifulsoup4 Then import: from bs4 import BeautifulSoup See crummy.com/software/BeautifulSoup/bs4/doc for detailsNonstriated
Those pages for the lxml and tidy links are gone.Towbin
M
71

Use lxml.html.clean! It's VERY easy!

from lxml.html.clean import clean_html
print clean_html(html)

Suppose the following html:

html = '''\
<html>
 <head>
   <script type="text/javascript" src="evil-site"></script>
   <link rel="alternate" type="text/rss" src="evil-rss">
   <style>
     body {background-image: url(javascript:do_evil)};
     div {color: expression(evil)};
   </style>
 </head>
 <body onload="evil_function()">
    <!-- I am interpreted for EVIL! -->
   <a href="javascript:evil_function()">a link</a>
   <a href="#" onclick="evil_function()">another link</a>
   <p onclick="evil_function()">a paragraph</p>
   <div style="display: none">secret EVIL!</div>
   <object> of EVIL! </object>
   <iframe src="evil-site"></iframe>
   <form action="evil-site">
     Password: <input type="password" name="password">
   </form>
   <blink>annoying EVIL!</blink>
   <a href="evil-site">spam spam SPAM!</a>
   <image src="evil!">
 </body>
</html>'''

The results...

<html>
  <body>
    <div>
      <style>/* deleted */</style>
      <a href="">a link</a>
      <a href="#">another link</a>
      <p>a paragraph</p>
      <div>secret EVIL!</div>
      of EVIL!
      Password:
      annoying EVIL!
      <a href="evil-site">spam spam SPAM!</a>
      <img src="evil!">
    </div>
  </body>
</html>

You can customize the elements you want to clean and whatnot.

Misleading answered 23/4, 2010 at 23:43 Comment(4)
See the docstring for lxml.html.clean.clean() method. It has plenty of options!Auroraauroral
Note that this uses a blacklist approach to filter out evil bits, rather than whitelist, but only a whitelisting approach can guarantee safety.Katelin
@SørenLøvborg: The Cleaner also supports a whitelist, using allow_tags.Ciapha
so nice! i like the default config, but let's say i want to add the removal of all <span>s, how do i do this?Acquaintance
T
50

Here's a simple solution using BeautifulSoup:

from bs4 import BeautifulSoup

VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']

def sanitize_html(value):

    soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True

    return soup.renderContents()

If you want to remove the contents of the invalid tags as well, substitute tag.extract() for tag.hidden.

You might also look into using lxml and Tidy.

Takin answered 30/3, 2009 at 23:35 Comment(6)
Thanks, I didn't need this ATM, but knew I would need to find something like this in the future.Ghazi
The import statement should probably be from BeautifulSoup import BeautifulSoup.Educatory
You may also want to limit the use of attributes. To do so, just add this to the solution above: valid_attrs = 'href src'.split() for ...: ... tag.attrs = [(attr, val) for attr, val in tag.attrs if attr in valid_attrs] hthGreek
This is not safe! See the answer by Chris Dost: #699968Palladian
This is awesome! One thing though, to install BeautifulSoap 4 run: easy_install beautifulsoup4 Then import: from bs4 import BeautifulSoup See crummy.com/software/BeautifulSoup/bs4/doc for detailsNonstriated
Those pages for the lxml and tidy links are gone.Towbin
B
41

The above solutions via Beautiful Soup will not work. You might be able to hack something with Beautiful Soup above and beyond them, because Beautiful Soup provides access to the parse tree. In a while, I think I'll try to solve the problem properly, but it's a week-long project or so, and I don't have a free week soon.

Just to be specific, not only will Beautiful Soup throw exceptions for some parsing errors which the above code doesn't catch; but also, there are plenty of very real XSS vulnerabilities that aren't caught, like:

<<script>script> alert("Haha, I hacked your page."); </</script>script>

Probably the best thing that you can do is instead to strip out the < element as &lt;, to prohibit all HTML, and then use a restricted subset like Markdown to render formatting properly. In particular, you can also go back and re-introduce common bits of HTML with a regex. Here's what the process looks like, roughly:

_lt_     = re.compile('<')
_tc_ = '~(lt)~'   # or whatever, so long as markdown doesn't mangle it.     
_ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))>', re.I)
_sqrt_ = re.compile(_tc_ + 'sqrt>', re.I)     #just to give an example of extending
_endsqrt_ = re.compile(_tc_ + '/sqrt>', re.I) #html syntax with your own elements.
_tcre_ = re.compile(_tc_)

def sanitize(text):
    text = _lt_.sub(_tc_, text)
    text = markdown(text)
    text = _ok_.sub(r'<\1>', text)
    text = _sqrt_.sub(r'&radic;<span style="text-decoration:overline;">', text)
    text = _endsqrt_.sub(r'</span>', text)
    return _tcre_.sub('&lt;', text)

I haven't tested that code yet, so there may be bugs. But you see the general idea: you have to blacklist all HTML in general before you whitelist the ok stuff.

Bloodfin answered 1/5, 2009 at 19:5 Comment(1)
if you're trying this first do: import re from markdown import markdown if you don't have markdown you can try easy_installFauman
G
25

Here is what i use in my own project. The acceptable_elements/attributes come from feedparser and BeautifulSoup does the work.

from BeautifulSoup import BeautifulSoup

acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
      'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
      'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em',
      'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 
      'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 
      'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike',
      'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th',
      'thead', 'tr', 'tt', 'u', 'ul', 'var']

acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
  'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
  'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols',
  'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 
  'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace',
  'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method',
  'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 
  'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size',
  'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
  'usemap', 'valign', 'value', 'vspace', 'width']

def clean_html( fragment ):
    while True:
        soup = BeautifulSoup( fragment )
        removed = False        
        for tag in soup.findAll(True): # find all tags
            if tag.name not in acceptable_elements:
                tag.extract() # remove the bad ones
                removed = True
            else: # it might have bad attributes
                # a better way to get all attributes?
                for attr in tag._getAttrMap().keys():
                    if attr not in acceptable_attributes:
                        del tag[attr]

        # turn it back to html
        fragment = unicode(soup)

        if removed:
            # we removed tags and tricky can could exploit that!
            # we need to reparse the html until it stops changing
            continue # next round

        return fragment

Some small tests to make sure this behaves correctly:

tests = [   #text should work
            ('<p>this is text</p>but this too', '<p>this is text</p>but this too'),
            # make sure we cant exploit removal of tags
            ('<<script></script>script> alert("Haha, I hacked your page."); <<script></script>/script>', ''),
            # try the same trick with attributes, gives an Exception
            ('<div on<script></script>load="alert("Haha, I hacked your page.");">1</div>',  Exception),
             # no tags should be skipped
            ('<script>bad</script><script>bad</script><script>bad</script>', ''),
            # leave valid tags but remove bad attributes
            ('<a href="good" onload="bad" onclick="bad" alt="good">1</div>', '<a href="good" alt="good">1</a>'),
]

for text, out in tests:
    try:
        res = clean_html(text)
        assert res == out, "%s => %s != %s" % (text, res, out)
    except out, e:
        assert isinstance(e, out), "Wrong exception %r" % e
Gnathonic answered 1/5, 2009 at 19:26 Comment(5)
This is not safe! See the answer by Chris Dost: #699968Palladian
@Thomas: Do you have anything to support that claim? Chris Dost "unsafe" code actually just raises an Exception, so I guess you didn't actually try it.Gnathonic
@THC4k: Sorry, I forgot to mention that I had to modify the example. Here's one that works: <<script></script>script> alert("Haha, I hacked your page."); <<script></script>script>Palladian
Also, the tag.extract() modifies a list that we're iterating over. That confuses the loop, and causes it to skip the next child.Palladian
@Thomas: Really nice catches! I think I fixed both issues, thanks alot!Gnathonic
C
25

Bleach does better with more useful options. It's built on html5lib and ready for production. Check the documentation for the bleach.clean function. Its default configuration escapes unsafe tags like <script> while allowing useful tags like <a>.

import bleach
bleach.clean("<script>evil</script> <a href='http://example.com'>example</a>")
# '&lt;script&gt;evil&lt;/script&gt; <a href="http://example.com">example</a>'
Chavira answered 26/10, 2013 at 9:41 Comment(2)
Does bleach still allow data: urls via html5lib by default? One can embed a data: url with content type of html for example.Heyduck
2019, and struggling with this: #7539100 - for me, lxml.html.cleaner was more solid, removing style tags completely, whereas bleach let's you with your css visible as content.Repairman
M
10

I modified Bryan's solution with BeautifulSoup to address the problem raised by Chris Drost. A little crude, but does the job:

from BeautifulSoup import BeautifulSoup, Comment

VALID_TAGS = {'strong': [],
              'em': [],
              'p': [],
              'ol': [],
              'ul': [],
              'li': [],
              'br': [],
              'a': ['href', 'title']
              }

def sanitize_html(value, valid_tags=VALID_TAGS):
    soup = BeautifulSoup(value)
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]
    # Some markup can be crafted to slip through BeautifulSoup's parser, so
    # we run this repeatedly until it generates the same output twice.
    newoutput = soup.renderContents()
    while 1:
        oldoutput = newoutput
        soup = BeautifulSoup(newoutput)
        for tag in soup.findAll(True):
            if tag.name not in valid_tags:
                tag.hidden = True
            else:
                tag.attrs = [(attr, value) for attr, value in tag.attrs if attr in valid_tags[tag.name]]
        newoutput = soup.renderContents()
        if oldoutput == newoutput:
            break
    return newoutput

Edit: Updated to support valid attributes.

Maturation answered 9/3, 2011 at 12:56 Comment(1)
tag.attrs = [(attr, value) for attr, value in tag.attrs if attr in valid_tags[tag.name]] -- tag.attrs is a dict, so this should be tag.attrs = {attr: value for attr, value in tag.attrs.items() if attr in valid_tags[tag.name]} use the bs4Zed
D
4

I use FilterHTML. It's simple and lets you define a well-controlled white-list, scrubs URLs and even matches attribute values against regex or have custom filtering functions per attribute. If used carefully it could be a safe solution. Here's a simplified example from the readme:

import FilterHTML

# only allow:
#   <a> tags with valid href URLs
#   <img> tags with valid src URLs and measurements
whitelist = {
  'a': {
    'href': 'url',
    'target': [
      '_blank',
      '_self'
    ],
    'class': [
      'button'
    ]
  },
  'img': {
    'src': 'url',
    'width': 'measurement',
    'height': 'measurement'
  },
}

filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist)
Deforce answered 18/2, 2013 at 0:37 Comment(0)
R
2

You could use html5lib, which uses a whitelist to sanitize.

An example:

import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer

def clean_html(buf):
    """Cleans HTML of dangerous tags and content."""
    buf = buf.strip()
    if not buf:
        return buf

    p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"),
            tokenizer=sanitizer.HTMLSanitizer)
    dom_tree = p.parseFragment(buf)

    walker = treewalkers.getTreeWalker("dom")
    stream = walker(dom_tree)

    s = serializer.htmlserializer.HTMLSerializer(
            omit_optional_tags=False,
            quote_attr_values=True)
    return s.render(stream) 
Raceway answered 20/3, 2013 at 23:21 Comment(3)
Why does sanitizer_factory exist? You should pass HTMLSanitizer directly.Unreasonable
@ChrisMorgan good question. I think I got this example from the html5lib site, and they were doing something to the sanitizer in the factory before returning it. But what they were doing was in the dev version and didn't work in the released version. So I just removed the line. It does look weird here. I'll research it and possibly update the answer.Raceway
@ChrisMorgan It looks like the feature I was referring to (stripping tokens instead of escaping them) never made it upstream, so I just removed the factory business. Thanks.Raceway
S
1

I prefer the lxml.html.clean solution, like nosklo points out. Here's to also remove some empty tags:

from lxml import etree
from lxml.html import clean, fromstring, tostring

remove_attrs = ['class']
remove_tags = ['table', 'tr', 'td']
nonempty_tags = ['a', 'p', 'span', 'div']

cleaner = clean.Cleaner(remove_tags=remove_tags)

def squeaky_clean(html):
    clean_html = cleaner.clean_html(html)
    # now remove the useless empty tags
    root = fromstring(clean_html)
    context = etree.iterwalk(root) # just the end tag event
    for action, elem in context:
        clean_text = elem.text and elem.text.strip(' \t\r\n')
        if elem.tag in nonempty_tags and \
        not (len(elem) or clean_text): # no children nor text
            elem.getparent().remove(elem)
            continue
        elem.text = clean_text # if you want
        # and if you also wanna remove some attrs:
        for badattr in remove_attrs:
            if elem.attrib.has_key(badattr):
                del elem.attrib[badattr]
    return tostring(root)
Strung answered 8/1, 2011 at 12:38 Comment(3)
It is better to use "return _transform_result(type(clean_html), root)" instead of "return tostring(root)". It will handle type check.Mientao
@luckyjazzbo: yeah, but then I'd be using a method that starts with underline. Those are private implementation details and shouldn't be used because they might change in a future version of lxml.Misleading
Apparently correct: _transform_result does not exist (any more) in lxml today.Mullet

© 2022 - 2024 — McMap. All rights reserved.