Is there an easy way to use the Python library html5lib to convert something like this:
<p>Hello World. Greetings from <strong>Mars.</strong></p>
to
Hello World. Greetings from Mars.
Is there an easy way to use the Python library html5lib to convert something like this:
<p>Hello World. Greetings from <strong>Mars.</strong></p>
to
Hello World. Greetings from Mars.
With lxml
as the parser backend:
import html5lib
body = "<p>Hello World. Greetings from <strong>Mars.</strong></p>"
doc = html5lib.parse(body, treebuilder="lxml")
print doc.text_content()
To be honest, this is actually cheating, as it is equivalent to the following (only the relevant parts are changed):
from lxml import html
doc = html.fromstring(body)
print doc.text_content()
If you really want the html5lib
parsing engine:
from lxml.html import html5parser
doc = html5parser.fromstring(body)
print doc.xpath("string()")
doc.xpath('string()')
. Also, as a side-note, that is essentially what the lxml.html.HtmlMixin
class does for the call to text_content()
that @JasonChrista mentioned. –
Fortuneteller text_content()
will only work in the case of lxml.html
, but not for lxml.html.html5parser
. I'm not sure if it is a bug or not, but the latter does not use lxml.html.HtmlMixin
where text_content()
is defined. Compare these two lxml.html.fromstring('<p>foo</p>').text_content()
versus lxml.html.html5parser.fromstring('<p>foo</p>').text_content()
–
Fortuneteller I use html2text, which converts it to plain text (in Markdown format).
from html2text import HTML2Text
handler = HTML2Text()
html = """Lorem <i>ipsum</i> dolor sit amet, <b>consectetur adipiscing</b> elit.<br>
<br><h1>Nullam eget \r\ngravida elit</h1>Integer iaculis elit at risus feugiat:
<br><br><ul><li>Egestas non quis \r\nlorem.</li><li>Nam id lobortis felis.
</li><li>Sed tincidunt nulla.</li></ul>
At massa tempus, quis \r\nvehicula odio laoreet.<br>"""
text = handler.handle(html)
>>> text
u'Lorem _ipsum_ dolor sit amet, **consectetur adipiscing** elit.\n\n \n\n# Nullam eget gravida elit\n\nInteger iaculis elit at risus feugiat:\n\n \n\n * Egestas non quis lorem.\n * Nam id lobortis felis.\n * Sed tincidunt nulla.\nAt massa tempus, quis vehicula odio laoreet.\n\n'
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> ================================ RESTART ================================ >>> Traceback (most recent call last): File "D:/test/scraping/test.py", line 1, in <module> from html2text import HTML2Text File "D:/test/scraping\html2text.py", line 5, in <module> print doc.text_content() AttributeError: 'lxml.etree._ElementTree' object has no attribute 'text_content' >>>
–
Muriah You can concatenate the result of the itertext()
method.
Example:
import html5lib
d = html5lib.parseFragment(
'<p>Hello World. Greetings from <strong>Mars.</strong></p>')
s = ''.join(d.itertext())
print(s)
Output:
Hello World. Greetings from Mars.
© 2022 - 2024 — McMap. All rights reserved.