Convert lxml _Element to HtmlElement
Asked Answered
H

1

10

For various reasons I'm trying to switch from lxml.html.fromstring() to lxml.html.html5parser.document_fromstring(). The big difference between the two is that the first returns an lxml.html.HtmlElement, and the second returns an lxml.etree._Element.

Mostly this is OK, but when I try to run my code with the _Element object, it crashes, saying:

AttributeError: 'lxml.etree._Element' object has no attribute 'rewrite_links'

Which makes sense. My question is, what's the best way to deal with this problem. I have a lot of code that expects HtmlElements, so I think the best solution will be to convert to those. I'm not sure that's possible though.

Update

One terrible solution looks like this:

from lxml.html import fromstring, tostring
from lxml.html import html5parser

e = html5parser.fromstring(text)
html_element = fromstring(tostring(e))

Obviously, that's pretty brute force, but it does work. I'm able to get an HtmlElement that's been parsed by the html5parser, which is what I'm after.

The other option would be to work out how to do the rewrite_links and xpath queries that I rely on, but _Elements don't seem to have that function (which, again, makes sense!)

Hege answered 14/10, 2015 at 20:4 Comment(1)
Have you tried parsing using etree.HTML()?Acadian
N
0

One solution less CPU intensive than brut force is to to create an almost empty HtmlElement based on the roottree and to append the _Element children.

from lxml.html import fromstring, tostring
from lxml.html import html5parser


text = "<html lang='en'><body><a href='http://localhost'>hello</body></html>"
e = html5parser.fromstring(text)

html_element = fromstring(tostring(e.getroottree()))
for child in e.getchildren():
  html_element.append(child)

print(tostring(html_element))


def rewriter(link):
  return "http://newlink.com"

html_element.rewrite_links(rewriter)
print(tostring(html_element.body)) 

Will output :

b'<html><body><html xmlns:html="http://www.w3.org/1999/xhtml" lang="en"><head></head><body><a href="http://localhost">hello</a></body></html></body><html:head xmlns:html="http://www.w3.org/1999/xhtml"></html:head><html:body xmlns:html="http://www.w3.org/1999/xhtml"><html:a href="http://localhost">hello</html:a></html:body></html>'
b'<body><html xmlns:html="http://www.w3.org/1999/xhtml" lang="en"><head></head><body><a href="http://newlink.com">hello</a></body></html></body>'

So both attributes like 'body' and methods like 'rewrite_links' work in this situation.

Ns answered 23/2, 2020 at 17:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.