Remove HTML tags not on an allowed list from a Python string

Asked 30/3, 2009 at 23:25 Answered 26/10, 2013 at 9:41

I have a string containing text and HTML. I want to remove or otherwise disable some HTML tags, such as <script>, while allowing others, so that I can render it on a web page safely. I have a list of allowed tags, how can I process the string to remove any other tags?

Tallith answered 30/3, 2009 at 23:25 Comment(3)

it should also remove all attributes not whitelisted... consider <img src="heh.png" onload="(function(){/* do bad stuff */}());" /> – Authorized 17/8, 2010 at 1:21

.. and also the useless empty tags and maybe consecutive br tags – Strung 8/1, 2011 at 14:38

Note that the first two answers are dangerous, because it's very easy to hide XSS from BS/lxml. – Lunalunacy 8/12, 2016 at 3:23

Here's a simple solution using BeautifulSoup:

from bs4 import BeautifulSoup

VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']

def sanitize_html(value):

    soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True

    return soup.renderContents()

If you want to remove the contents of the invalid tags as well, substitute tag.extract() for tag.hidden.

You might also look into using lxml and Tidy.

Takin answered 30/3, 2009 at 23:35 Comment(6)

Thanks, I didn't need this ATM, but knew I would need to find something like this in the future. – Ghazi 30/3, 2009 at 23:37

The import statement should probably be from BeautifulSoup import BeautifulSoup. – Educatory 30/5, 2009 at 21:21

You may also want to limit the use of attributes. To do so, just add this to the solution above: valid_attrs = 'href src'.split() for ...: ... tag.attrs = [(attr, val) for attr, val in tag.attrs if attr in valid_attrs] hth – Greek 3/8, 2009 at 20:46

This is not safe! See the answer by Chris Dost: #699968 – Palladian 10/9, 2010 at 11:33

This is awesome! One thing though, to install BeautifulSoap 4 run: easy_install beautifulsoup4 Then import: from bs4 import BeautifulSoup See crummy.com/software/BeautifulSoup/bs4/doc for details – Nonstriated 11/12, 2014 at 4:23

Those pages for the lxml and tidy links are gone. – Towbin 25/4, 2016 at 13:5

Use lxml.html.clean! It's VERY easy!

from lxml.html.clean import clean_html
print clean_html(html)

Suppose the following html:

html = '''\
<html>
 <head>
   <script type="text/javascript" src="evil-site"></script>
   <link rel="alternate" type="text/rss" src="evil-rss">
   <style>
     body {background-image: url(javascript:do_evil)};
     div {color: expression(evil)};
   </style>
 </head>
 <body onload="evil_function()">
    <!-- I am interpreted for EVIL! -->
   <a href="javascript:evil_function()">a link</a>
   <a href="#" onclick="evil_function()">another link</a>
   <p onclick="evil_function()">a paragraph</p>
   <div style="display: none">secret EVIL!</div>
   <object> of EVIL! </object>
   <iframe src="evil-site"></iframe>
   <form action="evil-site">
     Password: <input type="password" name="password">
   </form>
   <blink>annoying EVIL!</blink>
   <a href="evil-site">spam spam SPAM!</a>
   <image src="evil!">
 </body>
</html>'''

The results...

<html>
  <body>
    <div>
      <style>/* deleted */</style>
      <a href="">a link</a>
      <a href="#">another link</a>
      <p>a paragraph</p>
      <div>secret EVIL!</div>
      of EVIL!
      Password:
      annoying EVIL!
      <a href="evil-site">spam spam SPAM!</a>
      <img src="evil!">
    </div>
  </body>
</html>

You can customize the elements you want to clean and whatnot.

Misleading answered 23/4, 2010 at 23:43 Comment(4)

See the docstring for lxml.html.clean.clean() method. It has plenty of options! – Auroraauroral 17/7, 2010 at 13:39

Note that this uses a blacklist approach to filter out evil bits, rather than whitelist, but only a whitelisting approach can guarantee safety. – Katelin 26/11, 2011 at 21:10

@SørenLøvborg: The Cleaner also supports a whitelist, using allow_tags. – Ciapha 29/10, 2012 at 18:35

so nice! i like the default config, but let's say i want to add the removal of all <span>s, how do i do this? – Acquaintance 23/6, 2021 at 7:30

Here's a simple solution using BeautifulSoup:

from bs4 import BeautifulSoup

VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']

def sanitize_html(value):

    soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True

    return soup.renderContents()

If you want to remove the contents of the invalid tags as well, substitute tag.extract() for tag.hidden.

You might also look into using lxml and Tidy.

Takin answered 30/3, 2009 at 23:35 Comment(6)

Thanks, I didn't need this ATM, but knew I would need to find something like this in the future. – Ghazi 30/3, 2009 at 23:37

The import statement should probably be from BeautifulSoup import BeautifulSoup. – Educatory 30/5, 2009 at 21:21

This is not safe! See the answer by Chris Dost: #699968 – Palladian 10/9, 2010 at 11:33

Those pages for the lxml and tidy links are gone. – Towbin 25/4, 2016 at 13:5

The above solutions via Beautiful Soup will not work. You might be able to hack something with Beautiful Soup above and beyond them, because Beautiful Soup provides access to the parse tree. In a while, I think I'll try to solve the problem properly, but it's a week-long project or so, and I don't have a free week soon.

Just to be specific, not only will Beautiful Soup throw exceptions for some parsing errors which the above code doesn't catch; but also, there are plenty of very real XSS vulnerabilities that aren't caught, like:

<<script>script> alert("Haha, I hacked your page."); </</script>script>

Probably the best thing that you can do is instead to strip out the < element as <, to prohibit all HTML, and then use a restricted subset like Markdown to render formatting properly. In particular, you can also go back and re-introduce common bits of HTML with a regex. Here's what the process looks like, roughly:

_lt_     = re.compile('<')
_tc_ = '~(lt)~'   # or whatever, so long as markdown doesn't mangle it.     
_ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))>', re.I)
_sqrt_ = re.compile(_tc_ + 'sqrt>', re.I)     #just to give an example of extending
_endsqrt_ = re.compile(_tc_ + '/sqrt>', re.I) #html syntax with your own elements.
_tcre_ = re.compile(_tc_)

def sanitize(text):
    text = _lt_.sub(_tc_, text)
    text = markdown(text)
    text = _ok_.sub(r'<\1>', text)
    text = _sqrt_.sub(r'&radic;<span style="text-decoration:overline;">', text)
    text = _endsqrt_.sub(r'</span>', text)
    return _tcre_.sub('&lt;', text)

I haven't tested that code yet, so there may be bugs. But you see the general idea: you have to blacklist all HTML in general before you whitelist the ok stuff.

Bloodfin answered 1/5, 2009 at 19:5 Comment(1)

if you're trying this first do: import re from markdown import markdown if you don't have markdown you can try easy_install – Fauman 4/1, 2010 at 18:14

Here is what i use in my own project. The acceptable_elements/attributes come from feedparser and BeautifulSoup does the work.

from BeautifulSoup import BeautifulSoup

acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
      'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
      'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em',
      'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 
      'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 
      'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike',
      'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th',
      'thead', 'tr', 'tt', 'u', 'ul', 'var']

acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
  'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
  'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols',
  'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 
  'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace',
  'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method',
  'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 
  'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size',
  'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
  'usemap', 'valign', 'value', 'vspace', 'width']

def clean_html( fragment ):
    while True:
        soup = BeautifulSoup( fragment )
        removed = False        
        for tag in soup.findAll(True): # find all tags
            if tag.name not in acceptable_elements:
                tag.extract() # remove the bad ones
                removed = True
            else: # it might have bad attributes
                # a better way to get all attributes?
                for attr in tag._getAttrMap().keys():
                    if attr not in acceptable_attributes:
                        del tag[attr]

        # turn it back to html
        fragment = unicode(soup)

        if removed:
            # we removed tags and tricky can could exploit that!
            # we need to reparse the html until it stops changing
            continue # next round

        return fragment

Some small tests to make sure this behaves correctly:

tests = [   #text should work
            ('<p>this is text</p>but this too', '<p>this is text</p>but this too'),
            # make sure we cant exploit removal of tags
            ('<<script></script>script> alert("Haha, I hacked your page."); <<script></script>/script>', ''),
            # try the same trick with attributes, gives an Exception
            ('<div on<script></script>load="alert("Haha, I hacked your page.");">1</div>',  Exception),
             # no tags should be skipped
            ('<script>bad</script><script>bad</script><script>bad</script>', ''),
            # leave valid tags but remove bad attributes
            ('<a href="good" onload="bad" onclick="bad" alt="good">1</div>', '<a href="good" alt="good">1</a>'),
]

for text, out in tests:
    try:
        res = clean_html(text)
        assert res == out, "%s => %s != %s" % (text, res, out)
    except out, e:
        assert isinstance(e, out), "Wrong exception %r" % e

Gnathonic answered 1/5, 2009 at 19:26 Comment(5)

This is not safe! See the answer by Chris Dost: #699968 – Palladian 10/9, 2010 at 11:32

@Thomas: Do you have anything to support that claim? Chris Dost "unsafe" code actually just raises an Exception, so I guess you didn't actually try it. – Gnathonic 10/9, 2010 at 15:7

@THC4k: Sorry, I forgot to mention that I had to modify the example. Here's one that works: <<script></script>script> alert("Haha, I hacked your page."); <<script></script>script> – Palladian 10/9, 2010 at 15:19

Also, the tag.extract() modifies a list that we're iterating over. That confuses the loop, and causes it to skip the next child. – Palladian 10/9, 2010 at 15:20

@Thomas: Really nice catches! I think I fixed both issues, thanks alot! – Gnathonic 10/9, 2010 at 16:22

Bleach does better with more useful options. It's built on html5lib and ready for production. Check the documentation for the bleach.clean function. Its default configuration escapes unsafe tags like <script> while allowing useful tags like <a>.

import bleach
bleach.clean("<script>evil</script> <a href='http://example.com'>example</a>")
# '&lt;script&gt;evil&lt;/script&gt; <a href="http://example.com">example</a>'

Chavira answered 26/10, 2013 at 9:41 Comment(2)

Does bleach still allow data: urls via html5lib by default? One can embed a data: url with content type of html for example. – Heyduck 28/8, 2016 at 22:28

2019, and struggling with this: #7539100 - for me, lxml.html.cleaner was more solid, removing style tags completely, whereas bleach let's you with your css visible as content. – Repairman 28/8, 2019 at 9:33

I modified Bryan's solution with BeautifulSoup to address the problem raised by Chris Drost. A little crude, but does the job:

from BeautifulSoup import BeautifulSoup, Comment

VALID_TAGS = {'strong': [],
              'em': [],
              'p': [],
              'ol': [],
              'ul': [],
              'li': [],
              'br': [],
              'a': ['href', 'title']
              }

def sanitize_html(value, valid_tags=VALID_TAGS):
    soup = BeautifulSoup(value)
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]
    # Some markup can be crafted to slip through BeautifulSoup's parser, so
    # we run this repeatedly until it generates the same output twice.
    newoutput = soup.renderContents()
    while 1:
        oldoutput = newoutput
        soup = BeautifulSoup(newoutput)
        for tag in soup.findAll(True):
            if tag.name not in valid_tags:
                tag.hidden = True
            else:
                tag.attrs = [(attr, value) for attr, value in tag.attrs if attr in valid_tags[tag.name]]
        newoutput = soup.renderContents()
        if oldoutput == newoutput:
            break
    return newoutput

Edit: Updated to support valid attributes.

Maturation answered 9/3, 2011 at 12:56 Comment(1)

tag.attrs = [(attr, value) for attr, value in tag.attrs if attr in valid_tags[tag.name]] -- tag.attrs is a dict, so this should be tag.attrs = {attr: value for attr, value in tag.attrs.items() if attr in valid_tags[tag.name]} use the bs4 – Zed 31/10, 2014 at 3:21

I use FilterHTML. It's simple and lets you define a well-controlled white-list, scrubs URLs and even matches attribute values against regex or have custom filtering functions per attribute. If used carefully it could be a safe solution. Here's a simplified example from the readme:

import FilterHTML

# only allow:
#   <a> tags with valid href URLs
#   <img> tags with valid src URLs and measurements
whitelist = {
  'a': {
    'href': 'url',
    'target': [
      '_blank',
      '_self'
    ],
    'class': [
      'button'
    ]
  },
  'img': {
    'src': 'url',
    'width': 'measurement',
    'height': 'measurement'
  },
}

filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist)

Deforce answered 18/2, 2013 at 0:37 Comment(0)

You could use html5lib, which uses a whitelist to sanitize.

An example:

import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer

def clean_html(buf):
    """Cleans HTML of dangerous tags and content."""
    buf = buf.strip()
    if not buf:
        return buf

    p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"),
            tokenizer=sanitizer.HTMLSanitizer)
    dom_tree = p.parseFragment(buf)

    walker = treewalkers.getTreeWalker("dom")
    stream = walker(dom_tree)

    s = serializer.htmlserializer.HTMLSerializer(
            omit_optional_tags=False,
            quote_attr_values=True)
    return s.render(stream)

Raceway answered 20/3, 2013 at 23:21 Comment(3)

Why does sanitizer_factory exist? You should pass HTMLSanitizer directly. – Unreasonable 30/8, 2013 at 4:10

@ChrisMorgan good question. I think I got this example from the html5lib site, and they were doing something to the sanitizer in the factory before returning it. But what they were doing was in the dev version and didn't work in the released version. So I just removed the line. It does look weird here. I'll research it and possibly update the answer. – Raceway 30/8, 2013 at 13:1

@ChrisMorgan It looks like the feature I was referring to (stripping tokens instead of escaping them) never made it upstream, so I just removed the factory business. Thanks. – Raceway 30/8, 2013 at 13:34

I prefer the lxml.html.clean solution, like nosklo points out. Here's to also remove some empty tags:

from lxml import etree
from lxml.html import clean, fromstring, tostring

remove_attrs = ['class']
remove_tags = ['table', 'tr', 'td']
nonempty_tags = ['a', 'p', 'span', 'div']

cleaner = clean.Cleaner(remove_tags=remove_tags)

def squeaky_clean(html):
    clean_html = cleaner.clean_html(html)
    # now remove the useless empty tags
    root = fromstring(clean_html)
    context = etree.iterwalk(root) # just the end tag event
    for action, elem in context:
        clean_text = elem.text and elem.text.strip(' \t\r\n')
        if elem.tag in nonempty_tags and \
        not (len(elem) or clean_text): # no children nor text
            elem.getparent().remove(elem)
            continue
        elem.text = clean_text # if you want
        # and if you also wanna remove some attrs:
        for badattr in remove_attrs:
            if elem.attrib.has_key(badattr):
                del elem.attrib[badattr]
    return tostring(root)

Strung answered 8/1, 2011 at 12:38 Comment(3)

It is better to use "return _transform_result(type(clean_html), root)" instead of "return tostring(root)". It will handle type check. – Mientao 22/7, 2011 at 6:2

@luckyjazzbo: yeah, but then I'd be using a method that starts with underline. Those are private implementation details and shouldn't be used because they might change in a future version of lxml. – Misleading 28/11, 2011 at 10:50

Apparently correct: _transform_result does not exist (any more) in lxml today. – Mullet 14/6, 2013 at 21:56

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags