Should memory usage increase when using ElementTree.iterparse() when clear()ing trees?
Asked Answered
F

3

9
import os
import xml.etree.ElementTree as et

for ev, el in et.iterparse(os.sys.stdin):
    el.clear()

Running the above on the ODP structure RDF dump results in always increasing memory. Why is that? I understand ElementTree still builds a parse tree, albeit with the child nodes clear()ed. If that is the cause of this memory usage pattern, is there a way around it?

Femi answered 9/4, 2012 at 13:49 Comment(3)
Please clarify "always increasing". If you do the above in a loop, does the memory usage explode? Or do you merely see usage go up after doing this once, even after all objects are freed?Carlyn
I mean that I expect memory usage for the program above to remain constant. Instead, it shows a monotic increase.Femi
running the above in a loop has no effect, since it will just consume stdin.Femi
C
11

You are clearing each element but references to them remain in the root document. So the individual elements still cannot be garbage collected.

The solution is to clear references in the root, like so:

import xml.etree.ElementTree as ET

# get iterator
context = ET.iterparse(source, events=("start", "end"))

# get the root element
event, root = next(context)

for event, elem in context:
    if event == "end" and elem.tag == "record":
        # process record elements here...
        root.clear()

Another thing to remember about memory usage, which may not be affecting your situation, is that once the VM allocates memory for heap storage from the system, it generally never gives that memory back. Most Java VMs work this way too. So you should not expect the size of the interpreter in top or ps to ever decrease, even if that heap memory is unused.

update :

Code changed in order to work in Python 3+.

Carlyn answered 9/4, 2012 at 19:26 Comment(2)
Ah, that is what I wanted to hear. I understood ET was building still building a tree, but being new to it, I didn't know how to get at the root of it. Thanks!Femi
Great answer, Is there anyway to change events from ("start", "end") to ("end",) after getting the root ? I asked it because of possible performance improvement.Peaked
V
1

As mentioned in the answer by Kevin Guerra, the "root.clear()" strategy in the ElementTree documentation only removes fully parsed children of the root. If those children are anchoring huge branches, it's not very helpful.

He touched on the ideal solution, but didn't post any code, so here is an example:

element_stack = []
context = ET.iterparse(stream, events=('start', 'end'))
for event, elem in context:
    if event == 'start':
        element_stack.append(elem)
    elif event == 'end':
        element_stack.pop()
        # see if elem is one of interest and do something with it here
        if element_stack:
            element_stack[-1].remove(elem)
del context

The element of interest will not have subelements; they'll have been removed as soon as their end tags were seen. This might be OK if all you need is the element's text or attributes.

If you want to query into the element's descendants, you need to let a full branch be built for it. For this, maintain a flag, implemented as a depth counter for those elements. Only call .remove() when the depth is zero:

element_stack = []
interesting_element_depth = 0
context = ET.iterparse(stream, events=('start', 'end'))
for event, elem in context:
    if event == 'start':
        element_stack.append(elem)
        if elem.tag == 'foo':
            interesting_element_depth += 1
    elif event == 'end':
        element_stack.pop()
        if elem.tag == 'foo':
            interesting_element_depth -= 1
            # do something with elem and its descendants here
        if element_stack and not interesting_element_depth:
            element_stack[-1].remove(elem)
del context
Vibrissa answered 12/6, 2017 at 22:26 Comment(0)
M
0

I ran into the same issue. The documentation doesn't make things very clear. The issue in my case was:

1) Calling clear does release memory for the children nodes. Documentation says that it releases all memory. Clear does not release the memory for which clear is called, because that memory belongs to the parent which allocated it. 2) Calling root.clear(), that depends on what root is. If root is the parent then it would work. Otherwise, it will not free the memory.

The fix was to keep a reference to the parent, and when we no longer need the node, we call parent.remove(child_node). This worked and it kept the memory profile at a few KBs.

Macaroon answered 21/3, 2016 at 22:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.