using lxml and iterparse() to parse a big (+- 1Gb) XML file

Asked 24/3, 2012 at 22:25 Answered 12/2, 2017 at 22:22

Solved python xml parsing lxml iterparse

I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content":

<Database>
    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    [...]

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>
</Database>

So far I've tried two things: i) reading the whole file and going through it with .find(xmltag) and ii) parsing the xml file with lxml and iterparse(). The first option I've got it to work, but it is very slow. The second option I haven't managed to get it off the ground.

Here's part of what I have:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    if element.tag == "BlogPost":
        print element.text
    else:
        print 'Finished'

The result of that is only blank spaces, with no text in them.

I must be doing something wrong, but I can't grasp it. Also, In case it wasn't obvious enough, I am quite new to python and it is the first time I'm using lxml. Please, help!

Spectre answered 24/3, 2012 at 22:25 Comment(3)

Well, the BlogPost tags don't seem to contain any text in them. – Pelage 24/3, 2012 at 22:30

True. What would be the way to get everything that's between the opening and closing BlogPost tag? – Spectre 24/3, 2012 at 22:52

If you simply need all the info from inside the BlogPost tags, follow andrew's advice. If you want it HTML-formatted, apply lxml.etree.tostring() to them. – Pelage 24/3, 2012 at 22:56

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  for child in element:
    print(child.tag, child.text)
    element.clear()

the final clear will stop you from using too much memory.

[update:] to get "everything between ... as a string" i guess you want one of:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(etree.tostring(element))
  element.clear()

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([etree.tostring(child) for child in element]))
  element.clear()

or perhaps even:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([child.text for child in element]))
  element.clear()

Haroun answered 24/3, 2012 at 22:53 Comment(5)

This works pretty much like I wanted I'll have to customize it a bit, but it's great. Thanks! – Spectre 24/3, 2012 at 23:2

Is there a way to get everything between starting and ending "BlogPost" tags as a string? – Spectre 25/3, 2012 at 0:58

@mvime, as what kind of string? In HTML format? Then see my comment above, lxml.etree.tostring() method does that. You can cut the opening and closing tag off using slice notation (see this table) – Pelage 25/3, 2012 at 10:13

should the element.close() be element.clear() in the later fragments? so long since i wrote this i no longer remember, but it looks wrong to me. – Haroun 10/12, 2012 at 12:1

I also had to parse the 1.8 GB xml file, and also using the same clear function to clear the element, But clear() actually do not remove the element from the memory and at the end you end up having root with empty elements which takes memory too. So I deleted element after parsing using "del" statement, which helped me to free memory. Read effbot.org/zone/element-iterparse.htm#incremental-parsing this to know what exactly happens. – Priesthood 17/3, 2015 at 6:29

For future searchers: The top answer here suggests clearing the element on each iteration, but that still leaves you with an ever-increasing set of empty elements that will slowly build up in memory:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  for child in element:
    print(child.tag, child.text)
    element.clear()

^ This is not a scalable solution, especially as your source file gets larger and larger. The better solution is to get the root element, and clear that every time you load a complete record. This will keep memory usage pretty stable (sub-20MB I would say).

Here's a solution that doesn't require looking for a specific tag. This function will return a generator that yields all 1st child nodes (e.g. <BlogPost> elements) underneath the root node (e.g. <Database>). It does this by recording the start of the first tag after the root node, then waiting for the corresponding end tag, yielding the entire element, and then clearing the root node.

from lxml import etree

xmlfile = '/path/to/xml/file.xml'

def iterate_xml(xmlfile):
    doc = etree.iterparse(xmlfile, events=('start', 'end'))
    _, root = next(doc)
    start_tag = None
    for event, element in doc:
        if event == 'start' and start_tag is None:
            start_tag = element.tag
        if event == 'end' and element.tag == start_tag:
            yield element
            start_tag = None
            root.clear()

Forwent answered 12/2, 2017 at 22:22 Comment(1)

Well, quite liked the idea. But what if I need to support multiple file structures, how could I do it without finding specific tag? For example: say a have two type of xml file, in one the structure is source->jobs->job->..., in another, it's jobs->job. I want to fetch all the job only. How do I do it with this solution? – Julesjuley 21/4, 2018 at 12:2

I prefer XPath for such things:

In [1]: from lxml.etree import parse

In [2]: tree = parse('/tmp/database.xml')

In [3]: for post in tree.xpath('/Database/BlogPost'):
   ...:     print 'Author:', post.xpath('Author')[0].text
   ...:     print 'Content:', post.xpath('Content')[0].text
   ...: 
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.

I'm not sure if it's different in terms of processing big files, though. Comments about this would be appreciated.

Doing it your way,

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
     for info in element.iter():
         if info.tag in ('Author', 'Content'):
             print info.tag, ':', info.text

Pelage answered 24/3, 2012 at 22:36 Comment(6)

mm I've simplified the tree a little bit and when I try it it doesn't seem to work. The tag BlogPost for example is not simply '<BlogPost>' but '<BlogPost Owner="Author" Status="Draft">' and the values for Owner and Status change from one entry to the other. – Spectre 24/3, 2012 at 22:50

Additional attributes won't affect this; only the tree structure matters. To catch all the BlogPost elements, you can also use for post in tree.xpath('//BlogPost'): ... – Pelage 24/3, 2012 at 22:58

Thanks! I can't vote up yet, but you helped me understand how it works. The answer that I understand better and I have gotten to work is Andrew's though. – Spectre 24/3, 2012 at 23:1

Thanks @andrew. You have mine, too, mostly for the clear() method that I didn't know of. – Pelage 24/3, 2012 at 23:9

I made a comparison recently, and iterparse with clear() consumes much less memory than just XPath. – Pelage 9/4, 2012 at 19:28

XPath is very nice, but note that you had to read the entire tree in first with the call to parse(). Which doesn't scale well for large files. I have a 3.5 GB XML file I'm working with and parse() fails. The iterparse() approach still works. – Troll 5/7, 2022 at 9:0

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags