lxml convert element to elementtree
Asked Answered
B

3

15

The following test code reads a file, and using lxml.html generates the leaf nodes of the DOM/Graph for the page.

However, I'm also trying to figure out how to get the input from a "string". Using:

lxml.html.fromstring(s)

doesn't work, as this generates an Element as opposed to an ElementTree.

So, I'm trying to figure out how to convert an element to an ElementTree.

[my test code]

import lxml.html
from lxml import etree    # trying this to see if needed 
                          # to convert from element to elementtree


  #cmd='cat osu_test.txt'
  cmd='cat o2.txt'
  proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
  s=proc.communicate()[0].strip()

  # s contains HTML not XML text
  #doc = lxml.html.parse(s)
  doc = lxml.html.parse('osu_test.txt')
  doc1 = lxml.html.fromstring(s)

  for node in doc.iter():
  if len(node) == 0:
     print "aaa ",node.tag, doc.getpath(node)
     #print "aaa ",node.tag

  nt = etree.ElementTree(doc1)        <<<<< doesn't work.. so what will??
  for node in nt.iter():
  if len(node) == 0:
     print "aaa ",node.tag, doc.getpath(node)
     #print "aaa ",node.tag

UPDATE 1:

(parsing html instead of xml) Added the changes suggested by Abbas. got the following errs:

    doc1 = etree.fromstring(s)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48621)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72232)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71093)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67862)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64244)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65165)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64508)
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 48, column 220

UPDATE 2:

Managed to get the test working. I'm not exactly sure why. If someone with py chops wants to provide an explanation, that would help future people who stumble on this.

from cStringIO import StringIO
from lxml.html import parse

doc1 = parse(StringIO(s))

for node in doc1.iter():
    if len(node) == 0:
        print "aaa ", node.tag, doc1.getpath(node)

It appears that the StringIO module/class implements IO functionality which satisfies what the parse package needs to go ahead and process the input string for the test html. similar to what casting provides in other languages perhaps...

Bluejacket answered 12/1, 2012 at 2:52 Comment(3)
The xml parser is objecting to the '&nbsp;' in your HTML. Your HTML has to be well formed and either must not contain characters that the parser cannot digest or they should be escaped correctly.Psi
hey Abbas. I don't agree with what you're saying. The html in the test file now works, when I implement the solution I provided above, using the StringIO in the parse.Bluejacket
That's because you are now using an HTML parser (lxml.html) and StringIO. etree. etree tries to parse HTML but fails because of characters encoded for HTML (&nbsp;). I don't know why you would disagree with me when I proposed a solution based on your requirement of getting an ElementTree from etree by passing it an string. You later changed your solution, my solution is still valid for your original requirement.Psi
P
11

To get the root tree from an _Element (generated with lxml.html.fromstring), you can use the getroottree method:

doc = lxml.html.parse(s)
tree = doc.getroottree()
Papaveraceous answered 12/1, 2012 at 8:16 Comment(0)
P
6

The etree.fromstring method parses an XML string and returns a root element. The etree.ElementTree class is a tree wrapper around an element and as such requires an element for instantiation.

Therefore, passing the root element to the etree.ElementTree() constructor should give you what you want:

root = etree.fromstring(s)
nt = etree.ElementTree(root)
Psi answered 12/1, 2012 at 3:32 Comment(3)
hey Abbas. thanks for the reply... tried it, got the err listed above. (i'm parsing html, instead of xml)Bluejacket
Please add your HTML to the question as well.Psi
This works for an XML string parsed from an Atom-like responseBerryberryhill
F
1

An _Element, such that is returned by a call like:

tree = etree.HTML(result.read(), etree.HTMLParser())

Can be made an _ElementTree like so:

tree    = tree.getroottree() # convert _Element to _ElementTree

Hope that's what you expect.

Flaviaflavian answered 12/1, 2012 at 8:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.