I was trying to use their event-driven parsing, documented here. Testing it with the smaller settings file worked fine:
>>> settings = open('S:\\Documents\\FacebookData\\html\\settings.htm')
>>> for event, element in ET.iterparse(settings, events=("start", "end")):
print("%5s, %4s, %s" % (event, element.tag, element.text))
Successfully prints out the elements. However, using that same code with 'messages.htm' instead of 'settings.htm' just to see if it's working before even beginning the actual coding process, this is the result:
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
for event, element in ET.iterparse(source, events=("start", "end")):
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1294, in __next__
for event in self._parser.read_events():
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1277, in read_events
raise event
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1235, in feed
self._parser.feed(data)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 6
I'm wondering if this is because ET is just better suited to parsing XML documents? If this is the case, and there's no workaround, then I'm back to square one. Any suggestions on how to parse this file, along with how to debug along the way would be greatly appreciated!