python reporting line/column of origin of XML node

Asked 25/1, 2011 at 1:40 Answered 8/12, 2014 at 11:25

I'm currently using xml.dom.minidom to parse some XML in python. After parsing, I'm doing some reporting on the content, and would like to report the line (and column) where the tag started in the source XML document, but I don't see how that's possible.

I'd like to stick with xml.dom / xml.dom.minidom if possible, but if I need to use a SAX parser to get the origin info, I can do that -- ideal in that case would be using SAX to track node location, but still end up with a DOM for my post-processing.

Any suggestions on how to do this? Hopefully I'm just overlooking something in the docs and this extremely easy.

Vowel answered 25/1, 2011 at 1:40 Comment(2)

xmlparser from xml.parsers.expat supports line/column numbers. docs.python.org/library/pyexpat.html – Expressly 25/1, 2011 at 3:35

lxml.etree supports line numbers. codespeak.net/lxml – Expressly 25/1, 2011 at 3:46

By monkeypatching the minidom content handler I was able to record line and column number for each node (as the 'parse_position' attribute). It's a little dirty, but I couldn't see any "officially sanctioned" way of doing it :) Here's my test script:

from xml.dom import minidom
import xml.sax

doc = """\
<File>
  <name>Name</name>
  <pos>./</pos>
</File>
"""


def set_content_handler(dom_handler):
    def startElementNS(name, tagName, attrs):
        orig_start_cb(name, tagName, attrs)
        cur_elem = dom_handler.elementStack[-1]
        cur_elem.parse_position = (
            parser._parser.CurrentLineNumber,
            parser._parser.CurrentColumnNumber
        )

    orig_start_cb = dom_handler.startElementNS
    dom_handler.startElementNS = startElementNS
    orig_set_content_handler(dom_handler)

parser = xml.sax.make_parser()
orig_set_content_handler = parser.setContentHandler
parser.setContentHandler = set_content_handler

dom = minidom.parseString(doc, parser)
pos = dom.firstChild.parse_position
print("Parent: '{0}' at {1}:{2}".format(
    dom.firstChild.localName, pos[0], pos[1]))
for child in dom.firstChild.childNodes:
    if child.localName is None:
        continue
    pos = child.parse_position
    print "Child: '{0}' at {1}:{2}".format(child.localName, pos[0], pos[1])

It outputs the following:

Parent: 'File' at 1:0
Child: 'name' at 2:2
Child: 'pos' at 3:2

Scherzo answered 27/2, 2011 at 12:22 Comment(0)

A different way to hack around the problem is by patching line number information into the document before parsing it. Here's the idea:

LINE_DUMMY_ATTR = '_DUMMY_LINE' # Make sure this string is unique!
def parseXml(filename):
  f = file.open(filename, 'r')
  l = 0
  content = list ()
  for line in f:
    l += 1
    content.append(re.sub(r'<(\w+)', r'<\1 ' + LINE_DUMMY_ATTR + '="' + str(l) + '"', line))
  f.close ()

  return minidom.parseString ("".join(content))

Then you can retrieve the line number of an element with

int (element.getAttribute (LINE_DUMMY_ATTR))

Quite clearly, this approach has its own set of drawbacks, and if you really need column numbers, too, patching that in will be somewhat more involved. Also, if you want to extract text nodes or comments or use Node.toXml(), you'll have to make sure to strip out LINE_DUMMY_ATTR from any accidental matches, there.

The one advantage of this solution over aknuds1's answer is that it does not require messing with minidom internals.

Conklin answered 8/12, 2014 at 11:25 Comment(0)

Recommended topics

Hot tags