Use html5lib to convert an HTML fragment to plain text

Asked 31/12, 2011 at 0:19 Answered 19/4, 2017 at 16:34

Is there an easy way to use the Python library html5lib to convert something like this:

<p>Hello World. Greetings from <strong>Mars.</strong></p>

Hello World. Greetings from Mars.

Alguire answered 31/12, 2011 at 0:19 Comment(1)

If you are not stuck with poorly documented html5lib, #2558556 will help – Onida 31/12, 2011 at 0:31

With lxml as the parser backend:

import html5lib

body = "<p>Hello World. Greetings from <strong>Mars.</strong></p>"
doc = html5lib.parse(body, treebuilder="lxml")
print doc.text_content()

To be honest, this is actually cheating, as it is equivalent to the following (only the relevant parts are changed):

from lxml import html
doc = html.fromstring(body)
print doc.text_content()

If you really want the html5lib parsing engine:

from lxml.html import html5parser
doc = html5parser.fromstring(body)
print doc.xpath("string()")

Silma answered 31/12, 2011 at 0:37 Comment(5)

Looks like you can call doc.text_content() to also accomplish this. – Alguire 31/12, 2011 at 0:59

@Niklas you can write that a shorter way without the join by just doing doc.xpath('string()'). Also, as a side-note, that is essentially what the lxml.html.HtmlMixin class does for the call to text_content() that @JasonChrista mentioned. – Fortuneteller 31/12, 2011 at 1:54

@aculich: Thanks for the information. Could come in handy some time :) I'm updating the question. – Silma 31/12, 2011 at 1:55

@JasonChrista note that text_content() will only work in the case of lxml.html, but not for lxml.html.html5parser. I'm not sure if it is a bug or not, but the latter does not use lxml.html.HtmlMixin where text_content() is defined. Compare these two lxml.html.fromstring('foo').text_content() versus lxml.html.html5parser.fromstring('foo').text_content() – Fortuneteller 31/12, 2011 at 2:0

The html5lib one doesn't actually work, as aculich says. It also doesn't handle adding whitespace, like converting "a bc" to "a\nb\n\nc". – Nice 16/4, 2014 at 15:49

I use html2text, which converts it to plain text (in Markdown format).

from html2text import HTML2Text
handler = HTML2Text()

html = """Lorem <i>ipsum</i> dolor sit amet, <b>consectetur adipiscing</b> elit.<br>
          <br><h1>Nullam eget \r\ngravida elit</h1>Integer iaculis elit at risus feugiat:
          <br><br><ul><li>Egestas non quis \r\nlorem.</li><li>Nam id lobortis felis.
          </li><li>Sed tincidunt nulla.</li></ul>
          At massa tempus, quis \r\nvehicula odio laoreet.<br>"""
text = handler.handle(html)

>>> text
u'Lorem _ipsum_ dolor sit amet, **consectetur adipiscing** elit.\n\n  \n\n# Nullam eget gravida elit\n\nInteger iaculis elit at risus feugiat:\n\n  \n\n  * Egestas non quis lorem.\n  * Nam id lobortis felis.\n  * Sed tincidunt nulla.\nAt massa tempus, quis vehicula odio laoreet.\n\n'

Castroprauxel answered 11/11, 2013 at 10:58 Comment(7)

Just tried it again, still working for me. What's the issue? – Castroprauxel 17/5, 2014 at 12:51

thanks for followup.. this is what i got..

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> ================================ RESTART ================================ >>>   Traceback (most recent call last):   File "D:/test/scraping/test.py", line 1, in <module>     from html2text import HTML2Text   File "D:/test/scraping\html2text.py", line 5, in <module>     print doc.text_content() AttributeError: 'lxml.etree._ElementTree' object has no attribute 'text_content' >>>

– Muriah 17/5, 2014 at 18:3

unfortunately.. the other code in this page didn't work too.. with same or similar error. sorry! i don't know how to fix the fault.. or to put back the vote down in its original place. – Muriah 17/5, 2014 at 18:6

Oh dear, don't worry. I think this issue is probably a bug that's not really related to this question. You could consider reinstalling the latest versions of all the libraries (or try it in a fresh virtualenv), otherwise maybe the issue belongs as a separate question. Good luck! – Castroprauxel 19/5, 2014 at 11:38

it works now.. i re-installed python2.7. i noticed that some of the library .py files are changed by me accidentally.. and saved automatically (using pyscripter debug sessions). after re-install i have fresh set.. and the code works as expected. I really wish i can void the -1.. can you help me? – Muriah 20/5, 2014 at 13:9

why is this downvoted? using html2text is a perfectly good suggestion. – Lsd 16/10, 2014 at 17:13

It was a mistake on the part of a commenter above - feel free to vote me up if you like :) – Castroprauxel 17/10, 2014 at 7:59

You can concatenate the result of the itertext() method.

Example:

import html5lib
d = html5lib.parseFragment(
        '<p>Hello World. Greetings from <strong>Mars.</strong></p>')
s = ''.join(d.itertext())
print(s)

Output:

Hello World. Greetings from Mars.

Booze answered 19/4, 2017 at 16:34 Comment(0)

Recommended topics

Hot tags