Using Python's xml.etree to find element start and end character offsets
Asked Answered
O

5

8

I have XML data that looks like:

<xml>
The captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.
</xml>

I would like to be able to extract:

  1. The XML elements as they're currently provided in etree.
  2. The full plain text of the document, between the start and end tags.
  3. The location within the plain text of each start element, as a character offset.

(3) is the most important requirement right now; etree provides (1) fine.

I cannot see any way to do (3) directly, but hoped that iterating through the elements in the document tree would return many small string that could be re-assembled, thus providing (2) and (3). However, requesting the .text of the root node only returns text between the root node and the first element, e.g. "The capital of ".

Doing (1) with SAX could involve implementing a lot that's already been written many times over, in e.g. minidom and etree. Using lxml isn't an option for the package that this code is to go into. Can anybody help?

Orson answered 13/11, 2011 at 12:36 Comment(0)
C
5

iterparse() function is available in xml.etree:

import xml.etree.cElementTree as etree

for event, elem in etree.iterparse(file, events=('start', 'end')):
    if event == 'start':
       print(elem.tag) # use only tag name and attributes here
    elif event == 'end':
       # elem children elements, elem.text, elem.tail are available
       if elem.text is not None and elem.tail is not None:
          print(repr(elem.tail))

Another option is to override start(), data(), end() methods of etree.TreeBuilder():

from xml.etree.ElementTree import XMLParser, TreeBuilder

class MyTreeBuilder(TreeBuilder):

    def start(self, tag, attrs):
        print("&lt;%s>" % tag)
        return TreeBuilder.start(self, tag, attrs)

    def data(self, data):
        print(repr(data))
        TreeBuilder.data(self, data)

    def end(self, tag):
        return TreeBuilder.end(self, tag)

text = """<xml>
The captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.
</xml>"""

# ElementTree.fromstring()
parser = XMLParser(target=MyTreeBuilder())
parser.feed(text)
root = parser.close() # return an ordinary Element

Output

<xml>
'\nThe captial of '
<place>
'South Africa'
' is '
<place>
'Pretoria'
'.\n'
Connected answered 14/11, 2011 at 2:22 Comment(0)
D
4

You need to look at the .tail property as well as .text: .textgives you the text directly after a start tag, .tail gives you the text directly after the end tag. This will provide you with your "many small strings".

Tip: you can use etree.iterwalk(elem) (does the same thing as with etree.iterparse() but over an existing tree instead) to iterate over the start and end tags. To the idea:

for event, elem in etree.iterwalk(xml_elem, events=('start', 'end')):
    if event == 'start':
        # it's a start tag
        print 'starting element', elem.tag
        print elem.text
    elif event == 'end':
        # it's an end tag
        print 'ending element', elem.tag
        if elem is not xml_elem:
            # dont' want the text trailing xml_elem
            print elem.tail

I guess you can complete the rest for yourself? Warning: .text and .tail can be None, so if you want to concatenate you will have to guard against that (use (elem.text or '')for example)

If you are familiar with sax (or have existing sax code that does what you need), lxml lets you produce sax events from an element or tree:

lxml.sax.saxify(elem, handler)

Some other things to look for when extracting all the text from an element: the .itertext() method, the xpath expression .//text() (lxml lets you return "smart strings" from xpath expressions: they allow you to check which element they belong to etc...).

Distinguish answered 13/11, 2011 at 14:58 Comment(6)
Thanks! This looks perfect - although, I can only find iterwalk in lxml, and not the ElementTree that's bundled with Python. Am I looking in the wrong place?Orson
You are correct, it's only in lxml. Sorry, I'm so used to using lxml that I assumed you were too. (give it a try, it's great). But you should be able to make something yourself with the iter() methodDistinguish
elem.text might be unavailable at event == 'start'.Connected
@Leon Derczynski: iterparse() is available on all Python versions with xml.etree.Connected
@J.F. Sebastian: iterparse() is, but iterwalk() is only in lxml.Distinguish
Does itertext() not simply give you all text, in texts and tails?Brubaker
S
1

(3) can be done with XMLParser.CurrentByteIndex, like this:

import xml.etree.ElementTree as ET

class MyTreeBuilder(ET.TreeBuilder):
    def start(self, tag, attrs):
        print(parser.parser.CurrentByteIndex)
        ET.TreeBuilder.start(self, tag, attrs)

builder = MyTreeBuilder()
parser = ET.XMLParser(target=builder)
builder.parser = parser
tree = ET.parse('test.xml', parser=parser)

See also this answer for a SAX alternative. Take note however that the byte index is not the same as character index, and there may not be an efficient way to translate byte to character index in Python. (See also here.)

An (admittedly ugly) workaround to get character offsets instead of byte offsets is to recode bytes as characters. Assuming the actual encoding is utf8:

import xml.etree.ElementTree as ET

class MyTreeBuilder(ET.TreeBuilder):
    def start(self, tag, attrs):
        print(parser.parser.CurrentByteIndex)
        ET.TreeBuilder.start(self, tag, attrs)

builder = MyTreeBuilder()
parser = ET.XMLParser(target=builder)
builder.parser = parser
with open('test.xml', 'rb') as f:
    parser.feed(f.read().decode('latin1').encode('utf8'))
Sullen answered 10/8, 2016 at 14:19 Comment(0)
P
0

(2) is easy with SAX, see this snippet

from xml.sax.handler import ContentHandler
import xml.sax
import sys

class textHandler(ContentHandler):
    def characters(self, ch):
        sys.stdout.write(ch.encode("Latin-1"))

parser = xml.sax.make_parser()
handler = textHandler()
parser.setContentHandler(handler)
parser.parse("test.xml")

or the Example 1-1: bookhandler.py in this book http://oreilly.com/catalog/pythonxml/chapter/ch01.html

(3) is trickier, consult to this thread, it's Java, but there should be similar thing in Python SAX api How do I get the correct starting/ending locations of a xml tag with SAX?

Pomcroy answered 13/11, 2011 at 12:56 Comment(1)
Thanks! (2) and (3) are certainly easier with SAX. The last time I had this problem I used both SAX and minidom, but aligning the results of these two is a problem not worth approaching. I would move to SAX if I could do (1) easily enough. Do you know of any approaches for that?Orson
S
0

You can easily do all of this using Pawpaw:

Code:

import sys
sys.modules['_elementtree'] = None
import xml.etree.ElementTree as ET
from pawpaw import Ito, visualization, xml
text = """<xml>
The captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.
</xml>"""
root = ET.fromstring(text, parser=xml.XmlParser())

print('1. ET elements:\n')
print(elements := root.findall('.//'))
print()

print('2. Full plain text of document between start and end tags:\n')
start_tag = root.ito.find('*[d:start_tag]')
end_tag = root.ito.find('*[d:end_tag]')
ito = Ito(text, start_tag.stop, end_tag.start)
print(f'{ito:%substr!r}')
print()

print('3. Character offsets of plain text of each element:\n')
for e in elements:
    plain_text = e.ito.find('*[d:text]')
    print(f'{plain_text:%span: "%substr"}')
print()

Output:

1. ET elements:

[<Element 'place' at 0x1b0ffx203a0>, <Element 'place' at 0x1b0ffx21240>]

2. Full plain text of document between start and end tags:

'\nThe captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.\n'

3. Character offsets of plain text of each element:

(36, 48) "South Africa"
(67, 75) "Pretoria"

Bonus: using Pawpaw, you can get the character offset of any xml segment, such as:

  • elements
  • attributes
  • namespaces
  • tags
  • etc.

Example:

v_tree = visualization.pepo.Tree()
print(v_tree.dumps(root.ito))

Output:

(0, 91) 'element' : '<xml>\nThe captial o…ia</place>.\n</xml>'
├──(0, 5) 'start_tag' : '<xml>'
│  └──(1, 4) 'tag' : 'xml'
│     └──(1, 4) 'name' : 'xml'
├──(5, 21) 'text' : '\nThe captial of '
├──(21, 56) 'element' : '<place pid="1">South Africa</place>'
│  ├──(21, 36) 'start_tag' : '<place pid="1">'
│  │  ├──(22, 27) 'tag' : 'place'
│  │  │  └──(22, 27) 'name' : 'place'
│  │  └──(28, 35) 'attributes' : 'pid="1"'
│  │     └──(28, 31) 'attribute' : 'pid="1"'
│  │        ├──(28, 31) 'tag' : 'pid'
│  │        │  └──(28, 31) 'name' : 'pid'
│  │        └──(33, 34) 'value' : '1'
│  ├──(36, 48) 'text' : 'South Africa'
│  └──(48, 56) 'end_tag' : '</place>'
│     └──(50, 55) 'tag' : 'place'
│        └──(50, 55) 'name' : 'place'
├──(56, 60) 'text': ' is '
├──(60, 83) 'element' : '<place>Pretoria</place>'
│  ├──(60, 67) 'start_tag' : '<place>'
│  │  └──(61, 66) 'tag' : 'place'
│  │     └──(61, 66) 'name' : 'place'
│  ├──(67, 75) 'text' : 'Pretoria'
│  └──(75, 83) 'end_tag' : '</place>'
│     └──(77, 82) 'tag' : 'place'
│        └──(77, 82) 'name' : 'place'
├──(83, 85) 'text': '.\n'
└──(85, 91) 'end_tag' : '</xml>'
   └──(87, 90) 'tag' : 'xml'
      └──(87, 90) 'name' : 'xml'
Spoke answered 7/2, 2023 at 20:2 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.