Parsing very large HTML file with Python (ElementTree?)

P

2

3

I asked about using BeautifulSoup to parse a very large (270MB) HTML file and getting a memory error andwas pointed toward ElementTree as a solution.

I was trying to use their event-driven parsing, documented here. Testing it with the smaller settings file worked fine:

>>> settings = open('S:\\Documents\\FacebookData\\html\\settings.htm')
>>> for event, element in ET.iterparse(settings, events=("start", "end")):
    print("%5s, %4s, %s" % (event, element.tag, element.text))

Successfully prints out the elements. However, using that same code with 'messages.htm' instead of 'settings.htm' just to see if it's working before even beginning the actual coding process, this is the result:

Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    for event, element in ET.iterparse(source, events=("start", "end")):
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1294, in __next__
for event in self._parser.read_events():
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1277, in read_events
raise event
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1235, in feed
self._parser.feed(data)
  File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 6

I'm wondering if this is because ET is just better suited to parsing XML documents? If this is the case, and there's no workaround, then I'm back to square one. Any suggestions on how to parse this file, along with how to debug along the way would be greatly appreciated!

Polson answered 4/7, 2015 at 21:19 Comment(2)

try the HTML-Parser from lxml. – Selection 4/7, 2015 at 21:34

Iteratively parsing HTML (with lxml?) – Alanalana 5/7, 2015 at 2:9

S

3

A good solution for parsing HTML or XML is lxml and xpath.

To use xpath:

from lxml import etree
data = open('result.html','r').read()
doc = etree.HTML(data)

for tr in doc.xpath('//table/tr[@class="trmenu1"]'):
    print tr.xpath('./td/text()')

Seamount answered 4/7, 2015 at 21:51 Comment(0)

E

1

Html is not a perfect XML. That why in some case, you have use HTMLParser instead of ElementTree to parse html file.

Best regard Emmanuel

Edette answered 4/7, 2015 at 21:53 Comment(0)

Recommended topics

Hot tags