xml.sax parser and line numbers etc
Asked Answered
G

2

4

The task is to parse a simple XML document, and analyze the contents by line number.

The right Python package seems to be xml.sax. But how do I use it?

After some digging in the documentation, I found:

  • The xmlreader.Locator interface has the information: getLineNumber().
  • The handler.ContentHandler interface has setDocumentHandler().

The first thought would be to create a Locator, pass this to the ContentHandler, and read the information off the Locator during calls to its character() methods, etc.

BUT, xmlreader.Locator is only a skeleton interface, and can only return -1 from any of its methods. So as a poor user, WHAT am I to do, short of writing a whole Parser and Locator of my own??

I'll answer my own question presently.


(Well I would have, except for the arbitrary, annoying rule that says I can't.)


I was unable to figure this out using the existing documentation (or by web searches), and was forced to read the source code for xml.sax(under /usr/lib/python2.7/xml/sax/ on my system).

The xml.sax function make_parser() by default creates a real Parser, but what kind of thing is that?
In the source code one finds that it is an ExpatParser, defined in expatreader.py. And...it has its own Locator, an ExpatLocator. But, there is no access to this thing. Much head-scratching came between this and a solution.

  1. write your own ContentHandler, which knows about a Locator, and uses it to determine line numbers
  2. create an ExpatParser with xml.sax.make_parser()
  3. create an ExpatLocator, passing it the ExpatParser instance.
  4. make the ContentHandler, giving it this ExpatLocator
  5. pass the ContentHandler to the parser's setContentHandler()
  6. call parse() on the Parser.

For example:

import sys
import xml.sax

class EltHandler( xml.sax.handler.ContentHandler ):
    def __init__( self, locator ):
        xml.sax.handler.ContentHandler.__init__( self )
        self.loc = locator
        self.setDocumentLocator( self.loc )

    def startElement( self, name, attrs ): pass

    def endElement( self, name ): pass

    def characters( self, data ):
        lineNo = self.loc.getLineNumber()
        print >> sys.stdout, "LINE", lineNo, data

def spit_lines( filepath ):
    try:
        parser = xml.sax.make_parser()
        locator = xml.sax.expatreader.ExpatLocator( parser )
        handler = EltHandler( locator )
        parser.setContentHandler( handler )
        parser.parse( filepath )
    except IOError as e:
        print >> sys.stderr, e

if len( sys.argv ) > 1:
    filepath = sys.argv[1]
    spit_lines( filepath )
else:
    print >> sys.stderr, "Try providing a path to an XML file."

Martijn Pieters points out below another approach with some advantages. If the superclass initializer of the ContentHandler is properly called, then it turns out a private-looking, undocumented member ._locator is set, which ought to contain a proper Locator.

Advantage: you don't have to create your own Locator (or find out how to create it). Disadvantage: it's nowhere documented, and using an undocumented private variable is sloppy.

Thanks Martijn!

Greenery answered 18/3, 2013 at 12:55 Comment(0)
H
4

The sax parser itself is supposed to provide your content handler with a locator. The locator has to implement certain methods, but it can be any object as long as it has the right methods. The xml.sax.xmlreader.Locator class is the interface a locator is expected to implement; if the parser provided a locator object to your handler then you can count on those 4 methods being present on the locator.

The parser is only encouraged to set a locator, it is not required to do so. The expat XML parser does provide it.

If you subclass xml.sax.handler.ContentHandler() then it'll provide a standard setDocumentHandler() method for you, and by the time .startDocument() on the handler is called your content handler instance will have self._locator set:

from xml.sax.handler import ContentHandler

class MyContentHandler(ContentHandler):
    def __init__(self):
        ContentHandler.__init__(self)
        # initialize your handler

    def startElement(self, name, attrs):
        loc = self._locator
        if loc is not None:
            line, col = loc.getLineNumber(), loc.getColumnNumber()
        else:
            line, col = 'unknown', 'unknown'
        print 'start of {} element at line {}, column {}'.format(name, line, col)
Herwick answered 18/3, 2013 at 13:16 Comment(14)
Hi Martijn, Where does this self._locator come from? That is precisely the problem.Greenery
@SteveWhite: The xml.sax.handler.ContentHandler base class sets self._locator when setDocumentLocator() is called by the parser. You can also implement your own handler.setDocumentLocator(locator) method of course, but why have a dog and bark yourself?Herwick
Martijn, this _locator dog is documented where? And how do I get access to this member from the "MyContentHandler" subclass? I try it, and of course it says AttributeError: MyContentHandler instance has no attribute '_locator'Greenery
@SteveWhite: The source code is the only documentation. What Python version are you running this on? If you subclass xml.sax.handler.ContentHandler like I did in my code and make sure that the parent __init__ method is still invoked if you provide your own, then self._locator should be set.Herwick
@SteveWhite: A quick trawl through the mercurial repository shows that the ContentHandler base class has had a self._locator attribute for ever, so I guess you either do not subclass it or you overrode the __init__ without calling the parent. :-)Herwick
I am running Python 2.7. I see your point now (since your edit). The superclass constructor defines the _locator, which as you said, is set by the parser. It is plain to me that we are experiencing a point-of-view problem here. For the record: the user should not be expected to read the library source code, to determine how to use the library. And members marked with a leading underscore are intended to be private. What is worse? Using an undocumented, private variable in the source code, or using a barely-documented public user interface? I vote for the latter.Greenery
@SteveWhite: We are exploring some of the more 'crufty' corners of the standard library here; most people nowadays use the ElementTree API when it comes to parsing XML, but that doesn't give access to line numbers while parsing. The public interface is documented, and as I stated you can implement your own .setDocumentLocator() method. The self._locator attribute is a convenience, and should have been documented properly. _location is private only to the class, a subclass is certainly free to use it.Herwick
what would speak against a getDocumentLocator()?Greenery
@SteveWhite: Nothing, you'd be perfectly in your rights to implement your own version. The methods of the ContentHandler base class are generally expected to be overridden, they are only there to make sure nothing breaks if you do not implement them yourself.Herwick
Martijn, you misunderstood me. I view the documentation and/or interface as broken. I meant, what would speak against a getDocumentLocator() in the standard interface (effectively returning the private ._locator)? This (and the advice to call the superclass constructor) would have solved the problem for me.Greenery
Sorry, misunderstood indeed. The interface documents what the parser expects there to be. The parser has no need for a getDocumentHandler() method.Herwick
Right. Again, it's a point-of-view issue. I'm not a parser writer, I am a guy wanting to analyze an XML document. The interface and documentation could have and should have lead quickly to a solution. It failed in this. So the question is, how best to fix it?Greenery
@SteveWhite: Stop worrying about it and use my pointers? The documentation about the setDocumentLocator() method has been there all along, all I pointed out was that the default base class already implements it and sets self._locator if called.Herwick
@SteveWhite: The SAX interfaces are otherwise seen as outdated and archaic these days, and effort goes to the ElementTree API instead, so I don't see anyone going to expend much effort in improving the documentation situation around the SAX module. The module otherwise follows a standard API also implemented in other languages, so perhaps it sets some expectations that you are already familiar with the API.Herwick
S
3

This is an old question, but I think that there is a better answer to it than the one given, so I'm going to add another answer anyway.

While there may indeed be an undocumented private data member named _locator in the ContentHandler superclass, as described in the above answer by Martijn, accessing location information using this data member does not appear to me to be the intended use of the location facilities.

In my opinion, Steve White raises good questions about why this member is not documented. I think the answer to those questions is that it was probably not intended to be for public use. It appears to be a private implementation detail of the ContentHandler superclass. Since it is an undocumented private implementation detail, it could disappear without warning with any future release of the SAX library, so relying on it could be dangerous.

It appears to me, from reading the documentation for the ContentHandler class, and specifically the documentation for ContentHandler.setDocumentLocator, that the designers intended for users to instead override the ContentHandler.setDocumentLocator function so that when the parser calls it, the user's content handler subclass can save a reference to the passed-in locator object (which was created by the SAX parser), and can later use that saved object to get location information. For example:

class MyContentHandler(ContentHandler):
    def __init__(self):
        super().__init__()
        self._mylocator = None
        # initialize your handler

    def setDocumentLocator(self, locator):
        self._mylocator = locator

    def startElement(self, name, attrs):
        loc = self._mylocator
        if loc is not None:
            line, col = loc.getLineNumber(), loc.getColumnNumber()
        else:
            line, col = 'unknown', 'unknown'
        print 'start of {} element at line {}, column {}'.format(name, line, col)

With this approach, there is no need to rely on undocumented fields.

Stench answered 1/10, 2018 at 12:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.