Use html5lib to convert an HTML fragment to plain text
Asked Answered
A

3

7

Is there an easy way to use the Python library html5lib to convert something like this:

<p>Hello World. Greetings from <strong>Mars.</strong></p>

to

Hello World. Greetings from Mars.
Alguire answered 31/12, 2011 at 0:19 Comment(1)
If you are not stuck with poorly documented html5lib, #2558556 will helpOnida
S
12

With lxml as the parser backend:

import html5lib

body = "<p>Hello World. Greetings from <strong>Mars.</strong></p>"
doc = html5lib.parse(body, treebuilder="lxml")
print doc.text_content()

To be honest, this is actually cheating, as it is equivalent to the following (only the relevant parts are changed):

from lxml import html
doc = html.fromstring(body)
print doc.text_content()

If you really want the html5lib parsing engine:

from lxml.html import html5parser
doc = html5parser.fromstring(body)
print doc.xpath("string()")
Silma answered 31/12, 2011 at 0:37 Comment(5)
Looks like you can call doc.text_content() to also accomplish this.Alguire
@Niklas you can write that a shorter way without the join by just doing doc.xpath('string()'). Also, as a side-note, that is essentially what the lxml.html.HtmlMixin class does for the call to text_content() that @JasonChrista mentioned.Fortuneteller
@aculich: Thanks for the information. Could come in handy some time :) I'm updating the question.Silma
@JasonChrista note that text_content() will only work in the case of lxml.html, but not for lxml.html.html5parser. I'm not sure if it is a bug or not, but the latter does not use lxml.html.HtmlMixin where text_content() is defined. Compare these two lxml.html.fromstring('<p>foo</p>').text_content() versus lxml.html.html5parser.fromstring('<p>foo</p>').text_content()Fortuneteller
The html5lib one doesn't actually work, as aculich says. It also doesn't handle adding whitespace, like converting "a<br>b<p>c" to "a\nb\n\nc".Nice
C
4

I use html2text, which converts it to plain text (in Markdown format).

from html2text import HTML2Text
handler = HTML2Text()

html = """Lorem <i>ipsum</i> dolor sit amet, <b>consectetur adipiscing</b> elit.<br>
          <br><h1>Nullam eget \r\ngravida elit</h1>Integer iaculis elit at risus feugiat:
          <br><br><ul><li>Egestas non quis \r\nlorem.</li><li>Nam id lobortis felis.
          </li><li>Sed tincidunt nulla.</li></ul>
          At massa tempus, quis \r\nvehicula odio laoreet.<br>"""
text = handler.handle(html)

>>> text
u'Lorem _ipsum_ dolor sit amet, **consectetur adipiscing** elit.\n\n  \n\n# Nullam eget gravida elit\n\nInteger iaculis elit at risus feugiat:\n\n  \n\n  * Egestas non quis lorem.\n  * Nam id lobortis felis.\n  * Sed tincidunt nulla.\nAt massa tempus, quis vehicula odio laoreet.\n\n'
Castroprauxel answered 11/11, 2013 at 10:58 Comment(7)
Just tried it again, still working for me. What's the issue?Castroprauxel
thanks for followup.. this is what i got.. Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> ================================ RESTART ================================ >>> Traceback (most recent call last): File "D:/test/scraping/test.py", line 1, in <module> from html2text import HTML2Text File "D:/test/scraping\html2text.py", line 5, in <module> print doc.text_content() AttributeError: 'lxml.etree._ElementTree' object has no attribute 'text_content' >>> Muriah
unfortunately.. the other code in this page didn't work too.. with same or similar error. sorry! i don't know how to fix the fault.. or to put back the vote down in its original place.Muriah
Oh dear, don't worry. I think this issue is probably a bug that's not really related to this question. You could consider reinstalling the latest versions of all the libraries (or try it in a fresh virtualenv), otherwise maybe the issue belongs as a separate question. Good luck!Castroprauxel
it works now.. i re-installed python2.7. i noticed that some of the library .py files are changed by me accidentally.. and saved automatically (using pyscripter debug sessions). after re-install i have fresh set.. and the code works as expected. I really wish i can void the -1.. can you help me?Muriah
why is this downvoted? using html2text is a perfectly good suggestion.Lsd
It was a mistake on the part of a commenter above - feel free to vote me up if you like :)Castroprauxel
B
1

You can concatenate the result of the itertext() method.

Example:

import html5lib
d = html5lib.parseFragment(
        '<p>Hello World. Greetings from <strong>Mars.</strong></p>')
s = ''.join(d.itertext())
print(s)

Output:

Hello World. Greetings from Mars.
Booze answered 19/4, 2017 at 16:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.