Is there a way to get a line number from an ElementTree Element
Asked Answered
O

4

27

So I'm parsing some XML files using Python 3.2.1's cElementTree, and during the parsing I noticed that some of the tags were missing attribute information. I was wondering if there is any easy way of getting the line numbers of those Elements in the xml file.

Ockeghem answered 4/8, 2011 at 22:29 Comment(0)
I
20

Looking at the docs, I see no way to do this with cElementTree.

However I've had luck with lxmls version of the XML implementation. Its supposed to be almost a drop in replacement, using libxml2. And elements have a sourceline attribute. (As well as getting a lot of other XML features).

Only caveat is that I've only used it in python 2.x - not sure how/if it works under 3.x - but might be worth a look.

Addendum: from their front page they say :

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.3 to 3.2. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ.

So it looks like python 3.x is OK.

Imeldaimelida answered 5/8, 2011 at 3:1 Comment(1)
Works great, almost a 1:1 drop in. Only difference I've found so far is the exceptions.Ockeghem
T
23

Took a while for me to work out how to do this using Python 3.x (using 3.3.2 here) so thought I would summarize:

# Force python XML parser not faster C accelerators
# because we can't hook the C implementation
sys.modules['_elementtree'] = None
import xml.etree.ElementTree as ET

class LineNumberingParser(ET.XMLParser):
    def _start_list(self, *args, **kwargs):
        # Here we assume the default XML parser which is expat
        # and copy its element position attributes into output Elements
        element = super(self.__class__, self)._start_list(*args, **kwargs)
        element._start_line_number = self.parser.CurrentLineNumber
        element._start_column_number = self.parser.CurrentColumnNumber
        element._start_byte_index = self.parser.CurrentByteIndex
        return element

    def _end(self, *args, **kwargs):
        element = super(self.__class__, self)._end(*args, **kwargs)
        element._end_line_number = self.parser.CurrentLineNumber
        element._end_column_number = self.parser.CurrentColumnNumber
        element._end_byte_index = self.parser.CurrentByteIndex
        return element

tree = ET.parse(filename, parser=LineNumberingParser())
Trentontrepan answered 5/4, 2016 at 15:10 Comment(7)
Thanks. This works on Python 2.7.11. There is unnecessary ) after filename .Colenecoleopteran
Thanks, fixed the spurious bracketTrentontrepan
Can someone add a line showing usage of the _start_line_number attribute? I'm trying tree.getroot()._start_line_number and getting AttributeError.Probationer
In Python 3, function _start_list should be _start, both in the definition (def _start(self, *args, **kwargs):)and in the invocation (element = super(self.__class__, self)._start(*args, **kwargs) ).Inspector
@Probationer I managed to make it work on Python 3.6. The key is to add this line: sys.modules['_elementtree'] = None before you import xml.etree.ElementTree for the first time anywhere in your program. For example you can add sys.modules['_elementtree'] = None at the beginning of your script. Then after calling tree = ET.parse(filename, parser=LineNumberingParser()), tree.getroot()._start_line_number will work.Svensen
TypeError: descriptor 'feed' for 'xml.etree.ElementTree.XMLParser' objects doesn't apply to a 'bytes' objectHigley
I would like to add that adding a reload with ET as argument is also important if your lib/function is called in a different context such as pytest. Otherwise, you will not get those numbers.Ned
I
20

Looking at the docs, I see no way to do this with cElementTree.

However I've had luck with lxmls version of the XML implementation. Its supposed to be almost a drop in replacement, using libxml2. And elements have a sourceline attribute. (As well as getting a lot of other XML features).

Only caveat is that I've only used it in python 2.x - not sure how/if it works under 3.x - but might be worth a look.

Addendum: from their front page they say :

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.3 to 3.2. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ.

So it looks like python 3.x is OK.

Imeldaimelida answered 5/8, 2011 at 3:1 Comment(1)
Works great, almost a 1:1 drop in. Only difference I've found so far is the exceptions.Ockeghem
C
2

I've done this in elementtree by subclassing ElementTree.XMLTreeBuilder. Then where I have access to the self._parser (Expat) it has properties _parser.CurrentLineNumber and _parser.CurrentColumnNumber.

http://docs.python.org/py3k/library/pyexpat.html?highlight=xml.parser#xmlparser-objects has details about these attributes

During parsing you could print out info, or put these values into the output XML element attributes.

If your XML file includes additional XML files, you have to do some stuff that I don't remember and was not well documented to keep track of the current XML file.

Clea answered 5/8, 2011 at 3:53 Comment(0)
F
0

One (hackish) way of doing this is by inserting a dummy-attribute holding the line number into each element, before parsing. Here's how I did this with minidom:

python reporting line/column of origin of XML node

This can be trivially adjusted to cElementTree (or in fact any other python XML parser).

Fructificative answered 22/12, 2014 at 13:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.