using lxml and iterparse() to parse a big (+- 1Gb) XML file
Asked Answered
S

3

20

I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content":

<Database>
    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    [...]

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>
</Database>

So far I've tried two things: i) reading the whole file and going through it with .find(xmltag) and ii) parsing the xml file with lxml and iterparse(). The first option I've got it to work, but it is very slow. The second option I haven't managed to get it off the ground.

Here's part of what I have:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    if element.tag == "BlogPost":
        print element.text
    else:
        print 'Finished'

The result of that is only blank spaces, with no text in them.

I must be doing something wrong, but I can't grasp it. Also, In case it wasn't obvious enough, I am quite new to python and it is the first time I'm using lxml. Please, help!

Spectre answered 24/3, 2012 at 22:25 Comment(3)
Well, the BlogPost tags don't seem to contain any text in them.Pelage
True. What would be the way to get everything that's between the opening and closing BlogPost tag?Spectre
If you simply need all the info from inside the BlogPost tags, follow andrew's advice. If you want it HTML-formatted, apply lxml.etree.tostring() to them.Pelage
H
30
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  for child in element:
    print(child.tag, child.text)
    element.clear()

the final clear will stop you from using too much memory.

[update:] to get "everything between ... as a string" i guess you want one of:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(etree.tostring(element))
  element.clear()

or

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([etree.tostring(child) for child in element]))
  element.clear()

or perhaps even:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([child.text for child in element]))
  element.clear()
Haroun answered 24/3, 2012 at 22:53 Comment(5)
This works pretty much like I wanted I'll have to customize it a bit, but it's great. Thanks!Spectre
Is there a way to get everything between starting and ending "BlogPost" tags as a string?Spectre
@mvime, as what kind of string? In HTML format? Then see my comment above, lxml.etree.tostring() method does that. You can cut the opening and closing tag off using slice notation (see this table)Pelage
should the element.close() be element.clear() in the later fragments? so long since i wrote this i no longer remember, but it looks wrong to me.Haroun
I also had to parse the 1.8 GB xml file, and also using the same clear function to clear the element, But clear() actually do not remove the element from the memory and at the end you end up having root with empty elements which takes memory too. So I deleted element after parsing using "del" statement, which helped me to free memory. Read effbot.org/zone/element-iterparse.htm#incremental-parsing this to know what exactly happens.Priesthood
F
23

For future searchers: The top answer here suggests clearing the element on each iteration, but that still leaves you with an ever-increasing set of empty elements that will slowly build up in memory:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  for child in element:
    print(child.tag, child.text)
    element.clear()

^ This is not a scalable solution, especially as your source file gets larger and larger. The better solution is to get the root element, and clear that every time you load a complete record. This will keep memory usage pretty stable (sub-20MB I would say).

Here's a solution that doesn't require looking for a specific tag. This function will return a generator that yields all 1st child nodes (e.g. <BlogPost> elements) underneath the root node (e.g. <Database>). It does this by recording the start of the first tag after the root node, then waiting for the corresponding end tag, yielding the entire element, and then clearing the root node.

from lxml import etree

xmlfile = '/path/to/xml/file.xml'

def iterate_xml(xmlfile):
    doc = etree.iterparse(xmlfile, events=('start', 'end'))
    _, root = next(doc)
    start_tag = None
    for event, element in doc:
        if event == 'start' and start_tag is None:
            start_tag = element.tag
        if event == 'end' and element.tag == start_tag:
            yield element
            start_tag = None
            root.clear()
Forwent answered 12/2, 2017 at 22:22 Comment(1)
Well, quite liked the idea. But what if I need to support multiple file structures, how could I do it without finding specific tag? For example: say a have two type of xml file, in one the structure is source->jobs->job->..., in another, it's jobs->job. I want to fetch all the job only. How do I do it with this solution?Julesjuley
P
4

I prefer XPath for such things:

In [1]: from lxml.etree import parse

In [2]: tree = parse('/tmp/database.xml')

In [3]: for post in tree.xpath('/Database/BlogPost'):
   ...:     print 'Author:', post.xpath('Author')[0].text
   ...:     print 'Content:', post.xpath('Content')[0].text
   ...: 
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.

I'm not sure if it's different in terms of processing big files, though. Comments about this would be appreciated.

Doing it your way,

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
     for info in element.iter():
         if info.tag in ('Author', 'Content'):
             print info.tag, ':', info.text
Pelage answered 24/3, 2012 at 22:36 Comment(6)
mm I've simplified the tree a little bit and when I try it it doesn't seem to work. The tag BlogPost for example is not simply '<BlogPost>' but '<BlogPost Owner="Author" Status="Draft">' and the values for Owner and Status change from one entry to the other.Spectre
Additional attributes won't affect this; only the tree structure matters. To catch all the BlogPost elements, you can also use for post in tree.xpath('//BlogPost'): ...Pelage
Thanks! I can't vote up yet, but you helped me understand how it works. The answer that I understand better and I have gotten to work is Andrew's though.Spectre
Thanks @andrew. You have mine, too, mostly for the clear() method that I didn't know of.Pelage
I made a comparison recently, and iterparse with clear() consumes much less memory than just XPath.Pelage
XPath is very nice, but note that you had to read the entire tree in first with the call to parse(). Which doesn't scale well for large files. I have a 3.5 GB XML file I'm working with and parse() fails. The iterparse() approach still works.Troll

© 2022 - 2024 — McMap. All rights reserved.