Strip HTML from strings in Python
Asked Answered
E

29

366
from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line

When printing a line in an HTML file, I'm trying to find a way to only show the contents of each HTML element and not the formatting itself. If it finds '<a href="whatever.example">some text</a>', it will only print 'some text', '<b>hello</b>' prints 'hello', etc. How would one go about doing this?

Expressman answered 15/4, 2009 at 18:24 Comment(3)
An important consideration is how to handle HTML entities (e.g. &amp;). You can either 1) remove them along with the tags (often undesirable, and unnecessary as they are equivalent to plain text), 2) leave them unchanged (a suitable solution if the stripped text is going right back into an HTML context) or 3) decode them to plain text (if the stripped text is going into a database or some other non-HTML context, or if your web framework automatically performs HTML escaping of text for you).Bassarisk
for @SørenLøvborg point 2): #753552Sociometry
The top answer here, which was used by the Django project until March 2014, has been found to be insecure against cross-site scripting - see that link for an example that makes it through. I recommend using Bleach.clean(), Markupsafe's striptags, or RECENT Django's strip_tags.Hillel
P
538

I always used this function to strip HTML tags, as it requires only the Python stdlib:

For Python 3:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

For Python 2:

from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Penutian answered 29/5, 2009 at 11:47 Comment(18)
Beautiful Soup is awesome, but thanks so much for this also, this worked perfectly for me.Response
Two years+ later, facing the same issue, and this is a far more elegant solution. Only change I made was to return self.fed as a list, rather than joining it, so I could step through the element contents.Expressman
Note that this strips HTML entities (e.g. &amp;) as well as tags.Bassarisk
@surya I'm sure you've seen thisCondescending
I changed return ''.join(self.fed) into return self.fed. Now it's perfect.Campbellite
Am I able to add exceptions to the rule? Like if I'd like to keep all <body> or <head> tags?Intemperance
Thanks for the great answer. One thing to note for those of you using newer versions of Python (3.2+) is that you'll need to call the parent class's __init__ function. See here: #11061558.Lotuseater
This answer has been found appropriate for other questions in the same area as well.Drynurse
This is insecure and needs to be updated! Basically, run it in a loop until its output doesn't change anymore.Hillel
You may want to be aware that HtmlParser's entityref is like entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]') in Python 2.7.Fafnir
To keep the html entities (converted to unicode), I added two lines: parser = HTMLParser() and html = parser.unescape(html) to the beginning of the strip_tags function.Novgorod
the source code for Python 2.7: hg.python.org/cpython/file/2.7/Lib/HTMLParser.pyGateshead
The code removes &s in my input text, which is not what I want it to do. I only want it to remove html tags.Bonhomie
@Ellof and others, I used the same code and got the following error. UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 1006: character maps to <undefined>Glabrate
One of my coworkers found <<sc<script>script>alert(1)<</sc</script>/script>. If you pass that through this code, the output will be <script>alert(1)</script>. To be sure, I wrapped your solution with html.escape() to make sure no tags are left in the output.Horsa
To me it happend fairly often that two words became a single word. You can prevent that by changing get data to return ' '.join([x for x in self.fed if not x.isspace()])Rammer
How would one preserve spaces between HTML elements? Preventing <p>...</p><p>...</p> from getting mushed together?Gearard
Warning; This method is not sufficient to stop bad-actors. Eg <<b>b>Bold!<</b>/b> becomes <b>Bold!</b>.Parthinia
G
199

If you need to strip HTML tags to do text processing, a simple regular expression will do. Do not use this if you are looking to sanitize user-generated HTML to prevent XSS attacks. It is not a secure way to remove all <script> tags or tracking <img>s. The following regular expression will fairly reliably strip most HTML tags:

import re

re.sub('<[^<]+?>', '', text)

For those that don't understand regex, this searches for a string <...>, where the inner content is made of one or more (+) characters that isn't a <. The ? means that it will match the smallest string it can find. For example given <p>Hello</p>, it will match <'p> and </p> separately with the ?. Without it, it will match the entire string <..Hello..>.

If non-tag < appears in html (eg. 2 < 3), it should be written as an escape sequence &... anyway so the ^< may be unnecessary.

Guelders answered 2/2, 2011 at 1:9 Comment(11)
This is almost exactly how Django's strip_tags does it.Gules
Note that this leaves HTML entities (e.g. &amp;) unchanged in the output.Bassarisk
One can still trick this method with something like this: <script<script>>alert("Hi!")<</script>/script>Trichocyst
I suppose it would be up to the user to make sure the input is proper html.Guelders
DON'T DO IT THIS WAY! As @Julio Garcia says, it is NOT SAFE!Hillel
@Hillel you could use that regex to strip "non-malicious" tags, then encode < as &lt; to make sure it's actually safe.Kweisui
People, do not confuse HTML stripping and HTML sanitizing. Yes, for broken or malicious input this answer may produce output with HTML tags in it. It's still a perfectly valid approach to strip HTML tags. However, stripping HTML tags is not a valid substitution for proper HTML sanitizing. The rule is not hard: Any time you insert a plain-text string into HTML output, you should always HTML escape it (using cgi.escape(s, True)), even if you "know" that it doesn't contain HTML (e.g. because you stripped HTML content). However, this is not what OP asked about.Bassarisk
@JulioGarcía using [^>] instead of [^<] make it a little betterGrinder
you need to import reTula
It seems that Django now uses the method outlined in the accepted answer of this post: github.com/django/django/blob/…Digraph
re: comment "you should always HTML escape it (using cgi.escape(s, True))" -- cgi.escape: "Deprecated since version 3.2.... Use html.escape() instead." Removed in 3.8 -- so for 3.8+, use html.escape().N
P
124

You can use BeautifulSoup get_text() feature.

from bs4 import BeautifulSoup

html_str = '''
<td><a href="http://www.fakewebsite.example">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.example">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(html_str)

print(soup.get_text())
#or via attribute of Soup Object: print(soup.text)

It is advisable to explicitly specify the parser, for example as BeautifulSoup(html_str, features="html.parser"), for the output to be reproducible.

Pule answered 30/12, 2015 at 15:31 Comment(1)
It is now mandatory to set the parserBertilla
H
44

Short version!

import re, html
tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')

# Remove well-formed tags, fixing mistakes by legitimate users
no_tags = tag_re.sub('', user_input)

# Clean up anything else by escaping
ready_for_web = html.escape(no_tags)

Regex source: MarkupSafe. Their version handles HTML entities too, while this quick one doesn't.

Why can't I just strip the tags and leave it?

It's one thing to keep people from <i>italicizing</i> things, without leaving is floating around. But it's another to take arbitrary input and make it completely harmless. Most of the techniques on this page will leave things like unclosed comments (<!--) and angle-brackets that aren't part of tags (blah <<<><blah) intact. The HTMLParser version can even leave complete tags in, if they're inside an unclosed comment.

What if your template is {{ firstname }} {{ lastname }}? firstname = '<a' and lastname = 'href="http://evil.example/">' will be let through by every tag stripper on this page (except @Medeiros!), because they're not complete tags on their own. Stripping out normal HTML tags is not enough.

Django's strip_tags, an improved (see next heading) version of the top answer to this question, gives the following warning:

Absolutely NO guarantee is provided about the resulting string being HTML safe. So NEVER mark safe the result of a strip_tags call without escaping it first, for example with escape().

Follow their advice!

To strip tags with HTMLParser, you have to run it multiple times.

It's easy to circumvent the top answer to this question.

Look at this string (source and discussion):

<img<!-- --> src=x onerror=alert(1);//><!-- -->

The first time HTMLParser sees it, it can't tell that the <img...> is a tag. It looks broken, so HTMLParser doesn't get rid of it. It only takes out the <!-- comments -->, leaving you with

<img src=x onerror=alert(1);//>

This problem was disclosed to the Django project in March, 2014. Their old strip_tags was essentially the same as the top answer to this question. Their new version basically runs it in a loop until running it again doesn't change the string:

# _strip_once runs HTMLParser once, pulling out just the text of all the nodes.

def strip_tags(value):
    """Returns the given HTML with all tags stripped."""
    # Note: in typical case this loop executes _strip_once once. Loop condition
    # is redundant, but helps to reduce number of executions of _strip_once.
    while '<' in value and '>' in value:
        new_value = _strip_once(value)
        if len(new_value) >= len(value):
            # _strip_once was not able to detect more tags
            break
        value = new_value
    return value

Of course, none of this is an issue if you always escape the result of strip_tags().

Update 19 March, 2015: There was a bug in Django versions before 1.4.20, 1.6.11, 1.7.7, and 1.8c1. These versions could enter an infinite loop in the strip_tags() function. The fixed version is reproduced above. More details here.

Good things to copy or use

My example code doesn't handle HTML entities - the Django and MarkupSafe packaged versions do.

My example code is pulled from the excellent MarkupSafe library for cross-site scripting prevention. It's convenient and fast (with C speedups to its native Python version). It's included in Google App Engine, and used by Jinja2 (2.7 and up), Mako, Pylons, and more. It works easily with Django templates from Django 1.7.

Django's strip_tags and other HTML utilities from a recent version are good, but I find them less convenient than MarkupSafe. They're pretty self-contained, you could copy what you need from this file.

If you need to strip almost all tags, the Bleach library is good. You can have it enforce rules like "my users can italicize things, but they can't make iframes."

Understand the properties of your tag stripper! Run fuzz tests on it! Here is the code I used to do the research for this answer.

sheepish note - The question itself is about printing to the console, but this is the top Google result for "python strip HTML from string", so that's why this answer is 99% about the web.

Hillel answered 1/11, 2013 at 15:51 Comment(3)
My "alternate last line" example code doesn't handle html entities - how bad is that?Hillel
I am only parsing a small chunk of html with no special tags, and your short version does the job very well. Thanks for sharing!Aerodynamics
re: ready_for_web = cgi.escape(no_tags) -- cgi.escape is "Deprecated since version 3.2: This function is unsafe because quote is false by default, and therefore deprecated. Use html.escape() instead." Removed in 3.8.N
P
36

I needed a way to strip tags and decode HTML entities to plain text. The following solution is based on Eloff's answer (which I couldn't use because it strips entities).

import html.parser

class HTMLTextExtractor(html.parser.HTMLParser):
    def __init__(self):
        super(HTMLTextExtractor, self).__init__()
        self.result = [ ]

    def handle_data(self, d):
        self.result.append(d)

    def get_text(self):
        return ''.join(self.result)

def html_to_text(html):
    """Converts HTML to plain text (stripping tags and converting entities).
    >>> html_to_text('<a href="#">Demo<!--...--> <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>')
    'Demo (\xac \u0394\u03b7\u03bc\u03ce)'

    "Plain text" doesn't mean result can safely be used as-is in HTML.
    >>> html_to_text('&lt;script&gt;alert("Hello");&lt;/script&gt;')
    '<script>alert("Hello");</script>'

    Always use html.escape to sanitize text before using in an HTML context!

    HTMLParser will do its best to make sense of invalid HTML.
    >>> html_to_text('x < y &lt z <!--b')
    'x < y < z '

    Named entities are handled as per HTML 5.
    >>> html_to_text('&nosuchentity; &apos; ')
    "&nosuchentity; ' "
    """
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

A quick test:

html = '<a href="#">Demo <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>'
print(repr(html_to_text(html)))

Result:

'Demo (¬ Δημώ)'

Security note: Do not confuse HTML stripping (converting HTML into plain text) with HTML sanitizing (converting plain text into HTML). This answer will remove HTML and decode entities into plain text – that does not make the result safe to use in a HTML context.

Example: &lt;script&gt;alert("Hello");&lt;/script&gt; will be converted to <script>alert("Hello");</script>, which is 100% correct behavior, but obviously not sufficient if the resulting plain text is inserted as-is into an HTML page.

The rule is not hard: Any time you insert a plain-text string into HTML output, always HTML escape it (using html.escape(s)), even if you "know" that it doesn't contain HTML (e.g. because you stripped HTML content).

However, the OP asked about printing the result to the console, in which case no HTML escaping is needed. Instead you may want to strip ASCII control characters, as they can trigger unwanted behavior (especially on Unix systems):

import re
text = html_to_text(untrusted_html_input)
clean_text = re.sub(r'[\0-\x1f\x7f]+', '', text)
# Alternatively, if you want to allow newlines:
# clean_text = re.sub(r'[\0-\x09\x0b-\x1f\x7f]+', '', text)
print(clean_text)
Pecan answered 15/10, 2011 at 14:19 Comment(0)
G
25

There's a simple way to this:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
        if c == '<' and not quote:
            tag = True
        elif c == '>' and not quote:
            tag = False
        elif (c == '"' or c == "'") and tag:
            quote = not quote
        elif not tag:
            out = out + c

    return out

The idea is explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!

You're welcome! :)

Gingham answered 22/1, 2013 at 17:22 Comment(6)
I wonder why this answer just got downvoted. It's a simple way to solve the problem without any lib. Just pure python and it works as shown by the links.Gingham
Probably people prefer libs to give them safety. I tested you code and passed, and I always prefer small code that I understand than using a lib and assuming that it's ok until a bug pops up. For me that's what I was looking for and again thanks. Regarding the downvotes, don't get in that mindset. People here should care about the quality and not the votes. Lately SO has become a place where everyone want's points and not knowledge.Intermix
The problem with this solution is error handling. For example if you give <b class="o'>x</b> as input function outputs x. But actually this input is invalid. I think that's why people prefer libs.Tressa
It works with that input too. Just tested. Just realize that inside those libraries you'll find similar code. It's not very pythonic, I know. Looks like C or Java code. I think it's efficient and can easily ported to another language.Gingham
Simple, Pythonic and seems to work as well or better than any of the other methods discussed. It is possible it will not work for some ill formed HTML but there is no overcoming that.Fecit
If you care about performance, that out = out + c will be the worst of your nightmares. Instead, you could use a list that you’ll "".join at the end.Kissel
I
23

An lxml.html-based solution (lxml is a native library and can be more performant than a pure python solution).

To install the lxml module use pip install lxml

Remove ALL tags


from lxml import html


## from file-like object or URL
tree = html.parse(file_like_object_or_url)

## from string
tree = html.fromstring('safe <script>unsafe</script> safe')

print(tree.text_content().strip())

### OUTPUT: 'safe unsafe safe'

Remove ALL tags with pre-sanitizing HTML (dropping some tags)

from lxml import html
from lxml.html.clean import clean_html

tree = html.fromstring("""<script>dangerous</script><span class="item-summary">
                            Detailed answers to any questions you might have
                        </span>""")

## text only
print(clean_html(tree).text_content().strip())

### OUTPUT: 'Detailed answers to any questions you might have'

Also see http://lxml.de/lxmlhtml.html#cleaning-up-html for what exactly the lxml.cleaner does.

If you need more control over which specific tags should be removed before converting to text then create a custom lxml Cleaner with the desired options, e.g:

from lxml.html.clean import Cleaner

cleaner = Cleaner(page_structure=True,
                  meta=True,
                  embedded=True,
                  links=True,
                  style=True,
                  processing_instructions=True,
                  inline_style=True,
                  scripts=True,
                  javascript=True,
                  comments=True,
                  frames=True,
                  forms=True,
                  annoying_tags=True,
                  remove_unknown_tags=True,
                  safe_attrs_only=True,
                  safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                  remove_tags=('span', 'font', 'div')
                  )
sanitized_html = cleaner.clean_html(unsafe_html)

To customize how plain text is generated you can use lxml.etree.tostring instead of text_content():

from lxml.etree import tostring

print(tostring(tree, method='text', encoding=str))

Inelegant answered 25/2, 2017 at 21:19 Comment(5)
I got AttributeError: 'HtmlElement' object has no attribute 'strip'Floweret
@aris: that was for an older version of python and lxml, updated.Inelegant
Is there an option to replace the removed tags with an empty string eg " "?Cooperate
If anyone is wondering about the Cleaner import it's "from lxml.html.clean import Cleaner"Anisaanise
@TomasK: thanks! updated to save people's time searching for itInelegant
M
21

Here is a simple solution that strips HTML tags and decodes HTML entities based on the amazingly fast lxml library:

from lxml import html

def strip_html(s):
    return str(html.fromstring(s).text_content())

strip_html('Ein <a href="">sch&ouml;ner</a> Text.')  # Output: Ein schöner Text.
Milepost answered 28/11, 2019 at 17:20 Comment(3)
As of 2020, this was the fastest and best way for striping the contents of the HTML. Plus the bonus of handling the decoding. Great for language detection!Vitek
text_content() returns lxml.etree._ElementUnicodeResult so you might have to cast it to string firstBezant
@Bezant Good point. It seems to get auto-casted to str for string operations like + and indexing []. Added a cast for good measure anyhow.Milepost
S
16

If you need to preserve HTML entities (i.e. &amp;), I added "handle_entityref" method to Eloff's answer.

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)
    def get_data(self):
        return ''.join(self.fed)

def html_to_text(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Sociometry answered 4/12, 2012 at 13:25 Comment(0)
S
14

If you want to strip all HTML tags the easiest way I found is using BeautifulSoup:

from bs4 import BeautifulSoup  # Or from BeautifulSoup import BeautifulSoup

def stripHtmlTags(htmlTxt):
    if htmlTxt is None:
            return None
        else:
            return ''.join(BeautifulSoup(htmlTxt).findAll(text=True)) 

I tried the code of the accepted answer but I was getting "RuntimeError: maximum recursion depth exceeded", which didn't happen with the above block of code.

Slovakia answered 30/1, 2013 at 11:47 Comment(5)
I just tried your method because it seems cleaner, it worked, well kind of... it didn't strip input tags!Harrumph
I find that a simple application of BeautifulSoup has a problem with whitespaces: ''.join(BeautifulSoup('<em>he</em>llo<br>world').find_all(text=True)). Here the output is "helloworld", while you probably want it to be "hello world". ' '.join(BeautifulSoup('<em>he</em>llo<br>world').find_all(text=True)) does not help as it becomes "he llo world".Extenuatory
@Harrumph , sorry my ignorance , what do i put into the self argument? NameError: name 'self' is not definedFondue
@Fondue You can remove it, I assumed it's inside a class but it's not needed. I also edited the answer to remove itSlovakia
@Fondue You can remove it, I assumed it's inside a class but it's not needed. I also edited the answer to remove itSlovakia
W
8

The Beautiful Soup package does this immediately for you.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)
Wintry answered 28/5, 2017 at 9:33 Comment(1)
From review queue: May I request you to please add some more context around your answer. Code-only answers are difficult to understand. It will help the asker and future readers both if you can add more information in your post.Fatma
A
3

Here's a solution similar to the currently accepted answer (https://mcmap.net/q/92394/-strip-html-from-strings-in-python), except that it uses the internal HTMLParser class directly (i.e. no subclassing), thereby making it significantly more terse:

def strip_html(text):
    parts = []                                                                      
    parser = HTMLParser()                                                           
    parser.handle_data = parts.append                                               
    parser.feed(text)                                                               
    return ''.join(parts)
Anetteaneurin answered 19/10, 2018 at 16:28 Comment(0)
S
3

Here's my solution for python 3.

import html
import re

def html_to_txt(html_text):
    ## unescape html
    txt = html.unescape(html_text)
    tags = re.findall("<[^>]+>",txt)
    print("found tags: ")
    print(tags)
    for tag in tags:
        txt=txt.replace(tag,'')
    return txt

Not sure if it is perfect, but solved my use case and seems simple.

Stilu answered 18/2, 2019 at 13:5 Comment(0)
C
2

You can use either a different HTML parser (like lxml, or Beautiful Soup) -- one that offers functions to extract just text. Or, you can run a regex on your line string that strips out the tags. See Python docs for more.

Circassia answered 15/4, 2009 at 18:31 Comment(4)
amk link is dead. Got an alternative?Hahn
The Python website has good how-tos now, here is the regex how-to: docs.python.org/howto/regexCircassia
In lxml: lxml.html.fromstring(s).text_content()Gules
Bluu's example with lxml decodes HTML entities (e.g. &amp;) to text.Bassarisk
G
2

For one project, I needed so strip HTML, but also css and js. Thus, I made a variation of Eloffs answer:

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
        self.css = False
    def handle_starttag(self, tag, attrs):
        if tag == "style" or tag=="script":
            self.css = True
    def handle_endtag(self, tag):
        if tag=="style" or tag=="script":
            self.css=False
    def handle_data(self, d):
        if not self.css:
            self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Getter answered 15/1, 2018 at 12:52 Comment(0)
M
2

2020 Update

Use the Mozilla Bleach library, it really lets you customize which tags to keep and which attributes to keep and also filter out attributes based on values

Here are 2 cases to illustrate

1) Do not allow any HTML tags or attributes

Take sample raw text

raw_text = """
<p><img width="696" height="392" src="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg" class="attachment-medium_large size-medium_large wp-post-image" alt="Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC" style="float:left; margin:0 15px 15px 0;" srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w" sizes="(max-width: 696px) 100vw, 696px" />Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://news.bitcoin.com/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc/">Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC</a> appeared first on <a rel="nofollow" href="https://news.bitcoin.com">Bitcoin News</a>.</p> 
"""

2) Remove all HTML tags and attributes from raw text

# DO NOT ALLOW any tags or any attributes
from bleach.sanitizer import Cleaner
cleaner = Cleaner(tags=[], attributes={}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

Output

Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News. 

3 Allow Only img tag with srcset attribute

from bleach.sanitizer import Cleaner
# ALLOW ONLY img tags with src attribute
cleaner = Cleaner(tags=['img'], attributes={'img': ['srcset']}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

Output

<img srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w">Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News. 
Marqueritemarques answered 19/8, 2020 at 8:43 Comment(0)
R
1

I have used Eloff's answer successfully for Python 3.1 [many thanks!].

I upgraded to Python 3.2.3, and ran into errors.

The solution, provided here thanks to the responder Thomas K, is to insert super().__init__() into the following code:

def __init__(self):
    self.reset()
    self.fed = []

... in order to make it look like this:

def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []

... and it will work for Python 3.2.3.

Again, thanks to Thomas K for the fix and for Eloff's original code provided above!

Rey answered 18/6, 2012 at 15:29 Comment(0)
A
1

The solutions with HTML-Parser are all breakable, if they run only once:

html_to_text('<<b>script>alert("hacked")<</b>/script>

results in:

<script>alert("hacked")</script>

what you intend to prevent. if you use a HTML-Parser, count the Tags until zero are replaced:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
        self.containstags = False

    def handle_starttag(self, tag, attrs):
       self.containstags = True

    def handle_data(self, d):
        self.fed.append(d)

    def has_tags(self):
        return self.containstags

    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    must_filtered = True
    while ( must_filtered ):
        s = MLStripper()
        s.feed(html)
        html = s.get_data()
        must_filtered = s.has_tags()
    return html
Alanna answered 24/1, 2014 at 12:58 Comment(2)
If you call a function called html_to_text and you embed the text being output from that function inside html without escaping that text, then it is the lack of escaping, which is a security vulnerability, not the html_to_text function. The html_to_text function never promised you the output would be text. And inserting text into html without escaping is a potential security vulnerability regardless of whether you got the text from html_to_text or some other source.Popular
You are right in the case, of a lack to escaping, but the questions was to strip html from a given string not to escape a given string. If the earlier answers build new html with their solutions as result of removing some html, then is the usage of this solutions dangerous.Alanna
I
1

This is a quick fix and can be even more optimized but it will work fine. This code will replace all non empty tags with "" and strips all html tags form a given input text .You can run it using ./file.py input output

    #!/usr/bin/python
import sys

def replace(strng,replaceText):
    rpl = 0
    while rpl > -1:
        rpl = strng.find(replaceText)
        if rpl != -1:
            strng = strng[0:rpl] + strng[rpl + len(replaceText):]
    return strng


lessThanPos = -1
count = 0
listOf = []

try:
    #write File
    writeto = open(sys.argv[2],'w')

    #read file and store it in list
    f = open(sys.argv[1],'r')
    for readLine in f.readlines():
        listOf.append(readLine)         
    f.close()

    #remove all tags  
    for line in listOf:
        count = 0;  
        lessThanPos = -1  
        lineTemp =  line

            for char in lineTemp:

            if char == "<":
                lessThanPos = count
            if char == ">":
                if lessThanPos > -1:
                    if line[lessThanPos:count + 1] != '<>':
                        lineTemp = replace(lineTemp,line[lessThanPos:count + 1])
                        lessThanPos = -1
            count = count + 1
        lineTemp = lineTemp.replace("&lt","<")
        lineTemp = lineTemp.replace("&gt",">")                  
        writeto.write(lineTemp)  
    writeto.close() 
    print "Write To --- >" , sys.argv[2]
except:
    print "Help: invalid arguments or exception"
    print "Usage : ",sys.argv[0]," inputfile outputfile"
Iodoform answered 7/8, 2015 at 3:45 Comment(0)
A
1

A python 3 adaption of søren-løvborg's answer

from html.parser import HTMLParser
from html.entities import html5

class HTMLTextExtractor(HTMLParser):
    """ Adaption of https://mcmap.net/q/92394/-strip-html-from-strings-in-python """
    def __init__(self):
        super().__init__()
        self.result = []

    def handle_data(self, d):
        self.result.append(d)

    def handle_charref(self, number):
        codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
        self.result.append(unichr(codepoint))

    def handle_entityref(self, name):
        if name in html5:
            self.result.append(unichr(html5[name]))

    def get_text(self):
        return u''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()
Ardeth answered 16/5, 2017 at 8:2 Comment(0)
A
0

You can write your own function:

def StripTags(text):
     finished = 0
     while not finished:
         finished = 1
         start = text.find("<")
         if start >= 0:
             stop = text[start:].find(">")
             if stop >= 0:
                 text = text[:start] + text[start+stop+1:]
                 finished = 0
     return text
Apraxia answered 4/10, 2010 at 15:26 Comment(3)
Does appending to strings create a new copy of the string?Aguie
@Nerdling - Yes it does, which can lead to some rather impressive inefficiencies in frequently used functions (or, for that matter, infrequently used functions that act on large blobs of text.) See this page for for detail. :DNomanomad
Does it test against quoted strings? No.Intermix
S
0

I'm parsing Github readmes and I find that the following really works well:

import re
import lxml.html

def strip_markdown(x):
    links_sub = re.sub(r'\[(.+)\]\([^\)]+\)', r'\1', x)
    bold_sub = re.sub(r'\*\*([^*]+)\*\*', r'\1', links_sub)
    emph_sub = re.sub(r'\*([^*]+)\*', r'\1', bold_sub)
    return emph_sub

def strip_html(x):
    return lxml.html.fromstring(x).text_content() if x else ''

And then

readme = """<img src="https://raw.githubusercontent.com/kootenpv/sky/master/resources/skylogo.png" />

            sky is a web scraping framework, implemented with the latest python versions in mind (3.4+). 
            It uses the asynchronous `asyncio` framework, as well as many popular modules 
            and extensions.

            Most importantly, it aims for **next generation** web crawling where machine intelligence 
            is used to speed up the development/maintainance/reliability of crawling.

            It mainly does this by considering the user to be interested in content 
            from *domains*, not just a collection of *single pages*
            ([templating approach](#templating-approach))."""

strip_markdown(strip_html(readme))

Removes all markdown and html correctly.

Superfluity answered 2/1, 2016 at 10:10 Comment(0)
T
0

Using BeautifulSoup, html2text or the code from @Eloff, most of the time, it remains some html elements, javascript code...

So you can use a combination of these libraries and delete markdown formatting (Python 3):

import re
import html2text
from bs4 import BeautifulSoup
def html2Text(html):
    def removeMarkdown(text):
        for current in ["^[ #*]{2,30}", "^[ ]{0,30}\d\\\.", "^[ ]{0,30}\d\."]:
            markdown = re.compile(current, flags=re.MULTILINE)
            text = markdown.sub(" ", text)
        return text
    def removeAngular(text):
        angular = re.compile("[{][|].{2,40}[|][}]|[{][*].{2,40}[*][}]|[{][{].{2,40}[}][}]|\[\[.{2,40}\]\]")
        text = angular.sub(" ", text)
        return text
    h = html2text.HTML2Text()
    h.images_to_alt = True
    h.ignore_links = True
    h.ignore_emphasis = False
    h.skip_internal_links = True
    text = h.handle(html)
    soup = BeautifulSoup(text, "html.parser")
    text = soup.text
    text = removeAngular(text)
    text = removeMarkdown(text)
    return text

It works well for me but it can be enhanced, of course...

Tratner answered 27/12, 2017 at 14:41 Comment(0)
S
0

Simple code!. This will remove all kind of tags and content inside of it.

def rm(s):
    start=False
    end=False
    s=' '+s
    for i in range(len(s)-1):
        if i<len(s):
            if start!=False:
                if s[i]=='>':
                    end=i
                    s=s[:start]+s[end+1:]
                    start=end=False
            else:
                if s[i]=='<':
                    start=i
    if s.count('<')>0:
        self.rm(s)
    else:
        s=s.replace('&nbsp;', ' ')
        return s

But it won't give full result if text contains <> symbols inside it.

Scute answered 8/4, 2019 at 9:45 Comment(0)
T
0
# This is a regex solution.
import re
def removeHtml(html):
  if not html: return html
  # Remove comments first
  innerText = re.compile('<!--[\s\S]*?-->').sub('',html)
  while innerText.find('>')>=0: # Loop through nested Tags
    text = re.compile('<[^<>]+?>').sub('',innerText)
    if text == innerText:
      break
    innerText = text

  return innerText.strip()
Tris answered 8/12, 2019 at 10:35 Comment(0)
A
0

This is how I do it, but I have no idea what I am doing. I take data from a HTML table by stripping out the HTML tags.

This takes the string "name" and returns the string "name1" without the HTML tags.

x = 0
anglebrackets = 0
name1 = ""
while x < len(name):
    
    if name[x] == "<":
        anglebrackets = anglebrackets + 1
    if name[x] == ">":
        anglebrackets = anglebrackets - 1
    if anglebrackets == 0:
        if name[x] != ">":
            name1 = name1 + name[x]
    x = x + 1
Anglomania answered 16/6, 2021 at 15:5 Comment(0)
B
0
import re

def remove(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)
Boaz answered 6/3, 2022 at 7:28 Comment(0)
R
0

nh3 should also work:

>>> import nh3
>>> print(nh3.clean("<s><b>text</s></b>", tags={"b"}))
<b>text</b>
>>> print(nh3.clean("<s><b>text</s></b>", tags=set()))
text

performance should be good as it's wrapper for rust libtrary

Ribband answered 27/1 at 8:5 Comment(0)
A
-2

This method works flawlessly for me and requires no additional installations:

import re
import htmlentitydefs

def convertentity(m):
    if m.group(1)=='#':
        try:
            return unichr(int(m.group(2)))
        except ValueError:
            return '&#%s;' % m.group(2)
        try:
            return htmlentitydefs.entitydefs[m.group(2)]
        except KeyError:
            return '&%s;' % m.group(2)

def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convertentity,s)

html =  converthtml(html)
html.replace("&nbsp;", " ") ## Get rid of the remnants of certain formatting(subscript,superscript,etc).
Altocumulus answered 2/2, 2011 at 1:23 Comment(1)
This decodes HTML entities to plain text, but obviously doesn't actually strip any tags, which was the original question. (Also, the second try-except block needs to be de-indented for the code to even do as much).Bassarisk

© 2022 - 2024 — McMap. All rights reserved.