As iterparse
iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. The disadvantage is that it can consume a lot of memory.
In order to free some memory as you parse, use Liza Daly's fast_iter
:
def fast_iter(context, func, *args, **kwargs):
"""
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
which you could then use like this:
def process_element(elem):
print "why does this consume all my memory?"
context = lxml.etree.iterparse('really-big-file.xml', tag='schedule', events=('end',))
fast_iter(context, process_element)
I highly recommend the article on which the above fast_iter
is based; it should be especially interesting to you if you are dealing with large XML files.
The fast_iter
presented above is a slightly modified version of the one shown in the article. This one is more aggressive about deleting previous ancestors, thus saves more memory. Here you'll find a script which demonstrates the
difference.