I've been trying to parse some huge XML files that LXML won't grok, so I'm forced to parse them with xml.sax
.
class SpamExtractor(sax.ContentHandler):
def startElement(self, name, attrs):
if name == "spam":
print("We found a spam!")
# now what?
The problem is that I don't understand how to actually return
, or better, yield
, the things that this handler finds to the caller, without waiting for the entire file to be parsed. So far, I've been messing around with threading.Thread
and Queue.Queue
, but that leads to all kinds of issues with threads that are really distracting me from the actual problem I'm trying to solve.
I know I could run the SAX parser in a separate process, but I feel there must be a simpler way to get the data out. Is there?
cElementTree
, notElementTree
(2)lxml
also has aniterparse
which provides the same or better functionality (3) you need to mention deleting nodes after you have extracted the required info (4) AFAICT (never tried it) a generator should work OK – Fatma