Remove contents of <style>...</style> tags using html5lib or bleach
Asked Answered
B

2

5

I've been using the excellent bleach library for removing bad HTML.

I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like:

<STYLE> st1:*{behavior:url(#ieooui) } </STYLE>

Using bleach (with the style tag implicitly disallowed), leaves me with:

st1:*{behavior:url(#ieooui) }

Which isn't helpful. Bleach seems only to have options to:

  • Escape tags;
  • Remove the tags (but not their contents).

I'm looking for a third option - remove the tags and their contents.

Is there any way to use bleach or html5lib to completely remove the style tag and its contents? The documentation for html5lib isn't really a great deal of help.

Bluh answered 24/9, 2011 at 11:0 Comment(1)
I wonder if bleach is capable of achieving this in 2019? According to the docs, I don't think so?Bucolic
B
7

It turned out lxml was a better tool for this task:

from lxml.html.clean import Cleaner

def clean_word_text(text):
    # The only thing I need Cleaner for is to clear out the contents of
    # <style>...</style> tags
    cleaner = Cleaner(style=True)
    return cleaner.clean_html(text)
Bluh answered 24/9, 2011 at 21:0 Comment(1)
beware, using Cleaner(stye=True) implies using many default values... no?Bucolic
G
1

I was able to strip the contents of tags using a filter based on this approach: https://bleach.readthedocs.io/en/latest/clean.html?highlight=strip#html5lib-filters-filters. It does leave an empty <style></style> in the output, but that's harmless.

from bleach.sanitizer import Cleaner
from bleach.html5lib_shim import Filter

class StyleTagFilter(Filter):
    """
    https://bleach.readthedocs.io/en/latest/clean.html?highlight=strip#html5lib-filters-filters
    """

    def __iter__(self):
        in_style_tag = False
        for token in Filter.__iter__(self):
            if token["type"] == "StartTag" and token["name"] == "style":
                in_style_tag = True
            elif token["type"] == "EndTag":
                in_style_tag = False
            elif in_style_tag:
                # If we are in a style tag, strip the contents
                token["data"] = ""
            yield token


# You must include "style" in the tags list
cleaner = Cleaner(tags=["div", "style"], strip=True, filters=[StyleTagFilter])
cleaned = cleaner.clean("<div><style>.some_style { font-weight: bold; }</style>Some text</div>")

assert cleaned == "<div><style></style>Some text</div>"
Galligaskins answered 27/5, 2021 at 23:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.