Iteratively parsing HTML (with lxml?)
Asked Answered
D

5

8

I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) using lxml.etree.iterparse:

Incremental parser. Parses XML into a tree and generates tuples (event, element) in a SAX-like fashion

I am using an incremental/iterative/SAX approach to reduce the amount of memory used (I don't want to load the HTML into a DOM/tree because the file is large)

The problem I'm having is that I'm getting XML syntax errors such as:

lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59

This then causes everything to stop.

Is there a way to iteratively parse HTML without choking on syntax errors?

At the moment I'm extracting the line number from the XML syntax error exception, removing that line from the document, and then restarting the process. Seems like a pretty disgusting solution. Is there a better way?

Edit:

This is what I'm currently doing:

context = etree.iterparse(tfile, events=('start', 'end'), html=True)
in_table = False
header_row = True
while context:
    try:
        event, el = context.next()
        
        # do something

        # remove old elements
        while el.getprevious() is not None:
            del el.getparent()[0]

    except etree.XMLSyntaxError, e:
        print e.msg
        lineno = int(re.search(r'line (\d+),', e.msg).group(1))
        remove_line(tfilename, lineno)
        tfile = open(tfilename)
        context = etree.iterparse(tfile, events=('start', 'end'), html=True)
    except KeyError:
        print 'oops keyerror'
Demonstrative answered 12/12, 2011 at 16:41 Comment(0)
D
12

The perfect solution ended up being Python's very own HTMLParser [docs].

This is the (pretty bad) code I ended up using:

class MyParser(HTMLParser):
    def __init__(self):
        self.finished = False
        self.in_table = False
        self.in_row = False
        self.in_cell = False
        self.current_row = []
        self.current_cell = ''
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if not self.in_table:
            if tag == 'table':
                if ('id' in attrs) and (attrs['id'] == 'dgResult'):
                    self.in_table = True
        else:
            if tag == 'tr':
                self.in_row = True
            elif tag == 'td':
                self.in_cell = True
            elif (tag == 'a') and (len(self.current_row) == 7):
                url = attrs['href']
                self.current_cell = url


    def handle_endtag(self, tag):
        if tag == 'tr':
            if self.in_table:
                if self.in_row:
                    self.in_row = False
                    print self.current_row
                    self.current_row = []
        elif tag == 'td':
            if self.in_table:
                if self.in_cell:
                    self.in_cell = False
                    self.current_row.append(self.current_cell.strip())
                    self.current_cell = ''

        elif (tag == 'table') and self.in_table:
            self.finished = True

    def handle_data(self, data):
        if not len(self.current_row) == 7:
            if self.in_cell:
                self.current_cell += data

With that code I could then do this:

parser = MyParser()
for line in myfile:
    parser.feed(line)
Demonstrative answered 13/12, 2011 at 4:18 Comment(0)
J
6

At the moment lxml etree.iterparse supports keyword argument recover=True, so that instead of writing custom subclass of HTMLParser fixing broken html you can just pass this argument to iterparse.

To properly parse huge and broken html you only need to do following:

etree.iterparse(tfile, events=('start', 'end'), html=True, recover=True)
Jed answered 17/8, 2015 at 11:52 Comment(3)
This is the best answer for me.Hiroshima
This is a good suggestion. But please note if html is true and recover is not specified then recover will be True as mentioned in the documentation "recover: try hard to parse through broken input (default: True for HTML)" for iterparse lxml.de/api/lxml.etree.iterparse-class.htmlWhyalla
Oh so the lxml version 2.3 available around the time of OP's question (December, 2011), doesn't have the recover argument". Looks like recover argument was introduced in lxml version 3.3 (released in January 2014) , so it's good @Pawel point out the recover argument! (But OP doesn't need to change their code (they use html=True), just update lxml!)Whyalla
A
1

Use True for iterparse's arguments html and huge_tree.

Abbe answered 12/12, 2011 at 17:9 Comment(2)
I am currently using html=True, it still raises XML syntax errors. I'll take a look at the huge_tree parameter.Demonstrative
huge_tree doesn't seem relevant: "huge_tree: disable security restrictions and support very deep trees". My tree isn't deep, just long.Demonstrative
R
1

Sorry for rehashing an old question, but for late comers who are searching for solutions, lxml version 3.3 has HTMLPullParser and XMLPullParser which parse incrementally. One can also check out lxml introduction to parsing for more examples.

If you want to parse a very large document and save memory, you can write a custom target class as event handler to avoid building the element tree. Something like:

class MyParserTarget:
    def start(self, tag, attrib) -> None:
        # do something
    def end(self, tag) -> None:
        # do something
    def data(self, data) -> None:
        # do something
    def close(self):
        # return your result

mytarget = MyParserTarget()
parser = lxml.etree.HTMLPullParser(target=mytarget)
parser.feed(next(content))
# Do other stuff
result = parser.close()

If you continue to use etree.iterparse(..., html=True) as in the OP question, it will use HtmlPullParser under the hood., but iterparse will not pass a custom target instance (like I show here), not even in the latest version of lxml. Therefore if you prefer a custom target approach (vs events argument as shown in OP), you can use HtmlPullParser directly.

Roanna answered 18/8, 2021 at 14:39 Comment(0)
S
0

Try parsing your HTML document with lxml.html:

Since version 2.0, lxml comes with a dedicated Python package for dealing with HTML: lxml.html. It is based on lxml's HTML parser, but provides a special Element API for HTML elements, as well as a number of utilities for common HTML processing tasks.

Stringer answered 12/12, 2011 at 16:51 Comment(2)
I'm trying to iteratively parse the document due to its large size. lxml.html does not have an iterparse function as far as I can tell.Demonstrative
I suggested lxml.html because in the OP there was no mention of trying lxml.html. I think that down-voting my answer is rather misguided.Stringer

© 2022 - 2024 — McMap. All rights reserved.