Is there a fast XML parser in Python that allows me to get start of tag as byte offset in stream?
Asked Answered
C

1

4

I am working with potentially huge XML files containing complex trace information from on of my projects.

I would like to build indexes for those XML files so that one can quickly find sub sections of the XML document without having to load it all into memory.

If I have created a "shelve" index that could contains information like "books for author Joe" are at offsets [22322, 35446, 54545] then I can just open the xml file like a regular text file and seek to those offsets and then had that to one of the DOM parser that takes a file or strings.

The part that I have not figured out yet is how to quickly parse the XML and create such an index.

So what I need as a fast SAX parser that allows me to find the start offset of tags in the file together with the start events. So I can parse a subsection of the XML together with the starting point into the document, extract the key information and store the key and offset in the shelve index.

Crim answered 6/7, 2010 at 15:58 Comment(0)
D
3

Since locators return line and column numbers in lieu of offset, you need a little wrapping to track line ends -- a simplified example (could have some offbyones;-)...:

import cStringIO
import re
from xml import sax
from xml.sax import handler

relinend = re.compile(r'\n')

txt = '''<foo>
            <tit>Bar</tit>
        <baz>whatever</baz>
     </foo>'''
stm = cStringIO.StringIO(txt)

class LocatingWrapper(object):
    def __init__(self, f):
        self.f = f
        self.linelocs = []
        self.curoffs = 0

    def read(self, *a):
        data = self.f.read(*a)
        linends = (m.start() for m in relinend.finditer(data))
        self.linelocs.extend(x + self.curoffs for x in linends)
        self.curoffs += len(data)
        return data

    def where(self, loc):
        return self.linelocs[loc.getLineNumber() - 1] + loc.getColumnNumber()

locstm = LocatingWrapper(stm)

class Handler(handler.ContentHandler):
    def setDocumentLocator(self, loc):
        self.loc = loc
    def startElement(self, name, attrs):
        print '%s@%s:%s (%s)' % (name, 
                                 self.loc.getLineNumber(),
                                 self.loc.getColumnNumber(),
                                 locstm.where(self.loc))

sax.parse(locstm, Handler())

Of course you don't need to keep all of the linelocs around -- to save memory, you can drop "old" ones (below the latest one queried) but then you need to make linelocs a dict, etc.

Darwin answered 6/7, 2010 at 16:30 Comment(5)
Thanks. I got my XML indexer working with this code. I am waiting witch accepting to see if there are any answers that use a faster parser. Let me know if you would like to see it for an addition to the cookbook.Crim
@James, your Q explicitly said you wanted to use SAX, so I'm confused that you're now looking for other parsers within the same question (?). As for the Cookbook, thanks for offering, but I'm not currently maintaining the future edition (actually I don't know who is... if anybody... besides the online stuff at activestate of course, which isn't really gatewayed by anybody in particular and never has been).Darwin
when I said SAX I just meant a parser that does not load the whole document into memory. With fast I meant a parser with a speed similar to cElementTree.iterparse: effbot.org/zone/celementtree.htm. I am working with large documents so if I can index it 4 times faster then that would be worth it.Crim
@James, if you want "any fast incremental parser" I suggest not saying "SAX" instead (which more or less respects a standard == overhead;-). Anyway, you can sure use docs.python.org/library/… ... but there's no locator concept in etree (AFAIK). docs.python.org/library/pyexpat.html does offer a current byte index, docs.python.org/library/… , so you could try that (if you hadn't said "SAX", I'd have suggested that first!-).Darwin
'<c><b><a/><d/><e\r\n/></b></c>'.encode('utf-16')Disunity

© 2022 - 2024 — McMap. All rights reserved.