How can I parse HTML with html5lib, and query the parsed HTML with XPath?
Asked Answered
L

7

20

I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:

<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>

so lets try it:

>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>

that looks good, lets see what else we have:

>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>

LOL WUT?

seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.

Lilla answered 1/4, 2010 at 4:4 Comment(1)
Looks like it qualified the elements with a namespace and inserted some "implied" elements you didn't specify.Nan
B
26

Lack of documentation is a good reason to avoid a library IMO, no matter how cool it is. Are you wedded to using html5lib? Have you looked at lxml.html?

Here is a way to do this with lxml:

from lxml import html
tree = html.fromstring(text)
[td.text for td in tree.xpath("//td")]

Result:

['Header', 'Want This']
Baumann answered 1/4, 2010 at 5:13 Comment(0)
A
19

What you want to use is the namespaceHTMLElements argument, which for some reason defaults to True.

doc = html5lib.parse('''<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>
''', treebuilder='lxml', namespaceHTMLElements=False)

print lxml.html.tostring(doc)

It's probably still easier to use lxml.html however.

Astrometry answered 22/2, 2011 at 2:3 Comment(1)
It defaults to True because the HTML specification defines those elements to be in the HTML namespace — that existing Python tooling requires them not to be is the reason the option exists.Lederhosen
T
4

I always recommend to try out lxml library. It's blazingly fast and has many features.

It has also support for html5lib parser if you need that: html5parser

>>> from lxml.html import fromstring, tostring

>>> html = """
... <html>
...     <table>
...         <tr><td>Header</td></tr>
...         <tr><td>Want This</td></tr>
...     </table>
... </html>
... """
>>> doc = fromstring(html)
>>> tr = doc.cssselect('table tr')[1]
>>> print tostring(tr)
<tr><td>Want This</td></tr>
Tinworks answered 1/4, 2010 at 5:17 Comment(1)
This is how I'd do it, except I'd use "print doc.cssselect('tr')[1].text_content()" to get at the contents of the second row, rather than have lxml show the HTML.Otherworld
G
1

i believe you can do css search on lxml objects.. like so

elements = root.cssselect('div.content')
data = elements[0].text
Gabel answered 1/4, 2010 at 4:33 Comment(0)
G
1

With BeautifulSoup, you can do that with

>>> soup = BeautifulSoup.BeautifulSoup('<html><table><tr><td>Header</td></tr><tr><td>Want This</td></tr></table></html>')
>>> soup.findAll('td')[1].string
u'Want This'
>>> soup.findAll('tr')[1].td.string
u'Want This'

(Obviously that's a really crude example, but ya.)

Gotthard answered 1/4, 2010 at 4:36 Comment(0)
S
1

Since html5lib (by default) creates trees that contain (correct) namespace information you have specify (the right) namespaces in your queries, as well.

Example with an XPath query:

import html5lib
inp='''<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>'''
xns = '{http://www.w3.org/1999/xhtml}'
d = html5lib.parse(inp)
s = d.findall('.//{}td'.format(xns))[-1].text
print(s)

Output:

Want This

The same result without XPath:

s = d.find(xns+'body').find(xns+'table').find(xns+'tbody') \
     .findall(xns+'tr')[-1].find(xns+'td').text

Alternatively, you can also tell html5lib to avoid adding any namespace information during parsing:

d = html5lib.parse(inp, namespaceHTMLElements=False)
s = d.findall('.//td')[-1].text
print(s)

Output:

Want This
Scrobiculate answered 19/4, 2017 at 17:8 Comment(0)
A
-5

try using jquery. and you can retrieve all elements. alternately, you can put an id on your row and pull it out.

1) ... ...

$("td")[1].innerHTML will be what you want

2) ... ...

$("#blah").text() will be what you want

Ambulator answered 1/4, 2010 at 4:30 Comment(1)
I think the request was for a Python solution.Otherworld

© 2022 - 2024 — McMap. All rights reserved.