How to parse this huge XML file with nested elements using lxml the efficient way?
Asked Answered
L

2

8

I tried parsing this huge XML document using XML minidom. While it worked fine on a sample file, it choked the system when trying to process the real file (about 400 MB).

I tried adapting code (it processes data in a streaming fashion rather than in-memory load at once) from codereview for my xml file, I am having trouble isolating the datasets due to the nested nature of the elements. I worked on simple XML files before but not on a memory intensive task like this.

Is this the right approach? How do I associate the Inventory and Publisher IDs to each book? That is how I am planning to relate the 2 tables eventually.

Any feedback is much appreciated.

book.xml

<BookDatabase>
    <BookHeader>
        <Name>BookData</Name>
        <BookUniverse>All</BookUniverse>
        <AsOfDate>2010-05-02</AsOfDate>
        <Version>1.1</Version>
    </BookHeader>

    <InventoryBody>
        <Inventory ID="12">
            <PublisherClass ID="34">
                <Publisher>
                    <PublisherDetails>
                        <Name>Microsoft Press</Name>
                        <Type>Tech</Type>
                        <ID>7462</ID>
                    </PublisherDetails>
                </Publisher>
            </PublisherClass>
            <BookList>
                <Listing>
                    <BookListSummary>
                        <Date>2009-01-30</Date>
                    </BookListSummary>
                    <Book>
                        <BookDetail ID="67">
                            <BookName>Code Complete 2</BookName>
                            <Author>Steve McConnell</Author>
                            <Pages>960</Pages>
                            <ISBN>0735619670</ISBN>
                        </BookDetail>
                        <BookDetail ID="78">
                            <BookName>Application Architecture Guide 2</BookName>
                            <Author>Microsoft Team</Author>
                            <Pages>496</Pages>
                            <ISBN>073562710X</ISBN>
                        </BookDetail>
                    </Book>
                </Listing>
            </BookList>
        </Inventory>
        <Inventory ID="64">
            <PublisherClass ID="154">
                <Publisher>
                    <PublisherDetails>
                        <Name>O'Reilly Media</Name>
                        <Type>Tech</Type>
                        <ID>7484</ID>
                    </PublisherDetails>
                </Publisher>
            </PublisherClass>
            <BookList>
                <Listing>
                    <BookListSummary>
                        <Date>2009-03-30</Date>
                    </BookListSummary>
                    <Book>
                        <BookDetail ID="98">
                            <BookName>Head First Design Patterns</BookName>
                            <Author>Kathy Sierra</Author>
                            <Pages>688</Pages>
                            <ISBN>0596007124</ISBN>
                        </BookDetail>
                    </Book>
                </Listing>
            </BookList>
        </Inventory>
    </InventoryBody>
</BookDatabase>

Python Code :

import sys
import os
#import MySQLdb
from lxml import etree

CATEGORIES = set(['BookHeader', 'Inventory', 'PublisherClass', 'PublisherDetails', 'BookDetail'])
SKIP_CATEGORIES = set(['BookHeader'])
DATA_ITEMS = ["Name", "Type", "ID", "BookName", "Author", "Pages", "ISBN"]

def clear_element(element):
    element.clear()
    while element.getprevious() is not None:
        del element.getparent()[0]

def extract_book_elements(context):
    for event, element in context:
         if element.tag in CATEGORIES:
               yield element
               clear_element(element)                 

def fast_iter2(context):
    for bookCounter, element in enumerate(extract_book_elements(context)):
            books = [book.text for book in element.findall("BookDetail")]
            bookdetail = {
                'element' : element.tag,
                'ID' : element.get('ID')
            }
            for data_item in DATA_ITEMS:
                data = element.find(data_item)
                if data is not None:
                    bookdetail[data_item] = data

            if bookdetail['element'] not in SKIP_CATEGORIES:
                #populate_database(bookdetail, books, cursor)
                print bookdetail, books

            print "========>", bookCounter , "<======="

def main():
        #cursor = connectToDatabase()
        #cursor.execute("""SET NAMES utf8""")

        context = etree.iterparse("book.xml", events=("start", "end"))
        #fast_iter(context, cursor)
        fast_iter2(context)

        #cursor.close()


if __name__ == '__main__':
    main()            

Python Output :

$ python lxmletree_book.py 
========> 0 <=======
========> 1 <=======
{'ID': '12', 'element': 'Inventory'} []
========> 2 <=======
{'ID': '34', 'element': 'PublisherClass'} []
========> 3 <=======
{'Name': <Element Name at 0x105140af0>, 'Type': <Element Type at 0x105140b40>, 'ID': <Element ID at 0x105140b90>, 'element': 'PublisherDetails'} []
========> 4 <=======
{'ID': None, 'element': 'PublisherDetails'} []
========> 5 <=======
{'ID': None, 'element': 'PublisherClass'} []
========> 6 <=======
{'ISBN': <Element ISBN at 0x105140eb0>, 'Name': <Element Name at 0x105140dc0>, 'Author': <Element Author at 0x105140e10>, 'ID': '67', 'element': 'BookDetail', 'Pages': <Element Pages at 0x105140e60>} []
========> 7 <=======
{'ID': None, 'element': 'BookDetail'} []
========> 8 <=======
{'ISBN': <Element ISBN at 0x1051460a0>, 'Name': <Element Name at 0x105140f50>, 'Author': <Element Author at 0x105140fa0>, 'ID': '78', 'element': 'BookDetail', 'Pages': <Element Pages at 0x105146050>} []
========> 9 <=======
{'ID': None, 'element': 'BookDetail'} []
========> 10 <=======
{'ID': None, 'element': 'Inventory'} []
========> 11 <=======
{'ID': '64', 'element': 'Inventory'} []
========> 12 <=======
{'ID': '154', 'element': 'PublisherClass'} []
========> 13 <=======
{'Name': <Element Name at 0x105146230>, 'Type': <Element Type at 0x105146280>, 'ID': <Element ID at 0x1051462d0>, 'element': 'PublisherDetails'} []
========> 14 <=======
{'ID': None, 'element': 'PublisherDetails'} []
========> 15 <=======
{'ID': None, 'element': 'PublisherClass'} []
========> 16 <=======
{'ISBN': <Element ISBN at 0x1051465f0>, 'Name': <Element Name at 0x105146500>, 'Author': <Element Author at 0x105146550>, 'ID': '98', 'element': 'BookDetail', 'Pages': <Element Pages at 0x1051465a0>} []
========> 17 <=======
{'ID': None, 'element': 'BookDetail'} []
========> 18 <=======
{'ID': None, 'element': 'Inventory'} []
========> 19 <=======

Desired Output (eventually stored in MySQL - for now a List in Python):

Publishers
InventoryID PublisherClassID Name            Type ID
12          34               Microsoft Press Tech 7462
64          154              O'Reilly Media  Tech 7484

Books
PublisherID BookDetailID Name                              Author           Pages ISBN
7462        67           Code Complete 2                   Steve McConnell  960   0735619670
7462        78           Application Architecture Guide 2  Microsoft Team   496   073562710X
7484        98           Head First Design Patterns        Kathy Sierra     688   0596007124
Lynwoodlynx answered 11/9, 2011 at 18:0 Comment(7)
You might want to try a stream parser, processing the xml document one element at a time, rather then loading the entire thing into memory and working with the DOM. docs.python.org/library/pyexpat.htmlMauriac
Yep, I dumped my DOM code and tried using iterparse instead. This does the task (memory-wise), I am trying to associate each dataset with the IDs for storing in MySQL.Lynwoodlynx
Agree with goldsz...also since you are limited in memory you may want to store the relationships to disk. A temporary database should be sufficient. SQLite would probably be the quick solution (slower though).Surplice
Your logic is going to have to maintain more of the state. You probably want to break up the logic for each of your categories into a different a function. Maintain your object that represents that last PublisherDetails encountered, then each BookDetails encountered is related to that publisher.Mauriac
Thanks Yzmir, I will be storing the data in MySQL eventually. goldsz, if I create a blank list and populate it with the book and publisher data for each context - do you think that is a good idea? Working on a Sunday, trying to get this loaded in MySQL to generate a human readable CSV tomorrow. Looking for a quick fix - I will eventually explore an efficient solution for future.Lynwoodlynx
@Lynwoodlynx you do need to keep a list of everything. You just need to store what the ID of the most recent PublisherDetails element was. You should change extract_book_elements to do something different for the different categories. I believe you can also set up different events for the different tags.Mauriac
I'm also using xpath expressions. Have a look at #11850845.Infracostal
B
7

You might try something like this:

import MySQLdb
from lxml import etree
import config

def fast_iter(context, func, args=[], kwargs={}):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly    
    for event, elem in context:
        func(elem, *args, **kwargs)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def extract_paper_elements(element,cursor):
    pub={}        
    pub['InventoryID']=element.attrib['ID']
    try:
        pub['PublisherClassID']=element.xpath('PublisherClass/@ID')[0]
    except IndexError:
        pub['PublisherClassID']=None
    pub['PublisherClassID']=element.xpath('PublisherClass/@ID')[0]
    for key in ('Name','Type','ID'):
        try:
            pub[key]=element.xpath(
                'PublisherClass/Publisher/PublisherDetails/{k}/text()'.format(k=key))[0]
        except IndexError:
            pub[key]=None
    sql='''INSERT INTO Publishers (InventoryID, PublisherClassID, Name, Type, ID)
           VALUES (%s, %s, %s, %s, %s)
        '''
    args=[pub.get(key) for key in
          ('InventoryID', 'PublisherClassID', 'Name', 'Type', 'ID')]
    print(args)
    # cursor.execute(sql,args)
    for bookdetail in element.xpath('descendant::BookList/Listing/Book/BookDetail'):
        pub['BookDetailID']=bookdetail.attrib['ID']
        for key in ('BookName', 'Author', 'Pages', 'ISBN'):
            try:
                pub[key]=bookdetail.xpath('{k}/text()'.format(k=key))[0]
            except IndexError:
                pub[key]=None
        sql='''INSERT INTO Books
               (PublisherID, BookDetailID, Name, Author, Pages, ISBN)
               VALUES (%s, %s, %s, %s, %s, %s)
            '''           
        args=[pub.get(key) for key in
              ('ID', 'BookDetailID', 'BookName', 'Author', 'Pages', 'ISBN')]
        # cursor.execute(sql,args)
        print(args)


def main():
    context = etree.iterparse("book.xml", events=("end",), tag='Inventory')
    connection=MySQLdb.connect(
        host=config.HOST,user=config.USER,
        passwd=config.PASS,db=config.MYDB)
    cursor=connection.cursor()

    fast_iter(context,extract_paper_elements,args=(cursor,))

    cursor.close()
    connection.commit()
    connection.close()

if __name__ == '__main__':
    main()
  1. Don't use fast_iter2. The original fast_iter separates the useful utility from the specific processing function (extract_paper_elements). fast_iter2 mixes the two together leaving you with no repeatable code.
  2. If you set the tag parameter in etree.iterparse("book.xml", events=("end",), tag='Inventory') then your processing function extract_paper_elements will only see Inventory elements.
  3. Given an Inventory element you can use the xpath method to burrow down and scrape the desired data.
  4. args and kwargs parameters were added to fast_iter so cursor can be passed to extract_paper_elements.
Bacchius answered 11/9, 2011 at 19:32 Comment(22)
Thanks for getting me back on fast_iter. Limiting to Inventory tag makes so much sense too! Will let you know how it goes shortly...Lynwoodlynx
Awesome, it works great! Thank you so much for the explanation as well, I am sure newbies like me will benefit.Lynwoodlynx
Oops, looks like my happiness is short-lived :( I got a segmentation fault error : Stack trace @ pastebin.com/UyKuRdg4 It happened on both my Mac and CentOS machines. Python 2.7, help!Lynwoodlynx
@ThinkCode: Does the problem happen near the beginning of the run, or sometime in the middle? I couldn't identify the problem from the stack trace. Do you think it is a memory issue?Bacchius
It happens after it processes 2 inventories (each with about 100 books). Tried it on a different machine, same problem.Lynwoodlynx
@ThinkCode: Can you post enough of the XML at pastbin.com so I might try to reproduce the segfault?Bacchius
Is it possible that if some of the elements have some tags missing - a segment error may be caused?Lynwoodlynx
@ThinkCode: Okay, I downloaded the XML.Bacchius
@ThinkCode: You are right -- it's caused by elements have tags missing. It is fixable by changing pub[key] to pub.get(key). The latter returns None when key is not in pub. Also, remove .strip() from elt.text.strip(). It was not necessary, and interferes when elt.text is None. I've edited my post as well. (Btw, I never got a segmentation error, only KeyErrors).Bacchius
I made the changes and I still get the same error on my Mac. Were you able to print the whole 2.7MB file after these changes? Thank you!Lynwoodlynx
@ThinkCode: Yes, I was able to process the entire 13F.xml file. Did you change pub[key] --> pub.get(key) in both places?Bacchius
Yep, I changed it in both places. Both my Mac (Python 2.7) and CentOS (Python 2.6.6) machines were giving the same error message at the same place (after second element).Lynwoodlynx
let us continue this discussion in chatBacchius
@ThinkCode: Perhaps we can isolate the source of the problem by "bisecting" the code. By that I mean, try commenting out the second loop: for holdingdetail in element.xpath.... If you still get a segfault, continue commenting out more, until the segfault disappears. The last thing commented out might be the problem. If on the other hand, commenting out the second loop results in no segfault, then try uncommenting lines of the second loop until you get a segfault again. This might give us a clue as to where the problem lies.Bacchius
Interestingly enough, I get segmentation fault when I comment out either of the for loops. I really wonder why only my machines?! I will try it out on Ubuntu and let you know.Lynwoodlynx
I tried my Ubunutu VM and I get the segmentation fault after the 3rd element (set) instead of the second. It sure is mysterious!Lynwoodlynx
I tried commenting out various lines and I traced it back to these lines : for elt in element.xpath('PublisherClass/Publisher/PublisherDetails/child::*'): pub[elt.tag]=elt.textLynwoodlynx
@ThinkCode: I've rewritten the loops to use more explicit XPaths (without child::*). Perhaps try that.Bacchius
I suspect the problem was with the 2.7 MB sample file? I tried running the updated code on the full 389 MB file and I had issues with PublisherClassID (list index out of range), I commented it out and it ran fine. This is on Ubuntu btw and Mac is still failing. I suspect lxml version 2.3 on my Mac (Python 2.7)? I will continue exploring!Lynwoodlynx
Some bookdetail elements stop fetching data and print this data "[7462 ,67,None, None, None, None, None]" . More debugging I guess!Lynwoodlynx
let us continue this discussion in chatBacchius
I added some more observations in the chat.Lynwoodlynx
M
1

So if you want something quick and dirty you need to simply remember the last publisher parsed and assign it to each bookdetail until the next PublisherDetails is found. Add something like this to your fast_iter2 function after you extract the bookdetail:

        if bookdetail['element'] == "PublisherDetails" and bookdetail['ID'] is not None:
            CurrentPublisher = bookdetail
        if bookdetail['element'] == "BookDetail":
            bookdetail['PublisherID'] = CurrentPublisher['ID'].text
Mauriac answered 11/9, 2011 at 19:31 Comment(1)
Thanks goldsz, was working on this solution when unutbu led me to fast_iter rather than fast_iter2. Thanks a lot!Lynwoodlynx

© 2022 - 2024 — McMap. All rights reserved.