Why is lxml.etree.iterparse() eating up all my memory?
Asked Answered
G

3

35

This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference.

What am I doing wrong / how can I process this large file with iterparse()?

import lxml.etree

for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'):
    print "why does this consume all my memory?"

I can easily cut it up and process it in smaller chunks but that's uglier than I'd like.

Garrow answered 28/8, 2012 at 13:34 Comment(0)
L
41

As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. The disadvantage is that it can consume a lot of memory.

In order to free some memory as you parse, use Liza Daly's fast_iter:

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

which you could then use like this:

def process_element(elem):
    print "why does this consume all my memory?"

context = lxml.etree.iterparse('really-big-file.xml', tag='schedule', events=('end',))
fast_iter(context, process_element)

I highly recommend the article on which the above fast_iter is based; it should be especially interesting to you if you are dealing with large XML files.

The fast_iter presented above is a slightly modified version of the one shown in the article. This one is more aggressive about deleting previous ancestors, thus saves more memory. Here you'll find a script which demonstrates the difference.

Lattie answered 28/8, 2012 at 14:6 Comment(6)
Thanks! Both your solution and the one I just added seem to do the trick, I'm curious which one you and other people feel is a better solution. Do you have any thoughts?Garrow
Turns out your solution works and the effbot.org/zone/element-iterparse.htm solution did not (it still ate all my memory)Garrow
Thank you! This is the version that really works. Versions from Liza Daly, effbot, and lxml official docs did NOT save much memory for me.Cassiopeia
The IBM article is not available anymore, fortunately, it was archived: web.archive.org/web/20210309115224/http://www.ibm.com/…Dichasium
I'm getting AttributeErrors all over the place from this code - xml.etree.ElementTree.Element does not have xpath(), does not have getprevious()... are there multiple versions of this library or something?Workman
With Python 3.11 and the latest version of lxml, the solution in this answer seems to be no longer needed. I'm now using iter() for large XML files and it is faster than the solution in this answer with no memory issues.Molten
G
7

Directly copied from http://effbot.org/zone/element-iterparse.htm

Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()
Garrow answered 28/8, 2012 at 14:12 Comment(1)
Note that context.next() becomes next(context) in Python 3.Hendiadys
B
0

This worked really well for me:

def destroy_tree(tree):
    root = tree.getroot()

    node_tracker = {root: [0, None]}

    for node in root.iterdescendants():
        parent = node.getparent()
        node_tracker[node] = [node_tracker[parent][0] + 1, parent]

    node_tracker = sorted([(depth, parent, child) for child, (depth, parent)
                           in node_tracker.items()], key=lambda x: x[0], reverse=True)

    for _, parent, child in node_tracker:
        if parent is None:
            break
        parent.remove(child)

    del tree
Bogart answered 6/3, 2018 at 21:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.