Capture all XML element paths using xml.etree.ElementTree
Asked Answered
M

1

6

Using python import lxml I am able to print a list of the path for every element recursively:

from lxml import etree
root = etree.parse(xml_file)
for e in root.iter():
    path = root.getelementpath(e)
    print(path)

Results:

TreatmentEpisodes
TreatmentEpisodes/TreatmentEpisode
TreatmentEpisodes/TreatmentEpisode/SourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode/FederalTaxIdentifier
TreatmentEpisodes/TreatmentEpisode/ClientSourceRecordIdentifier
etc.

Note: I am working with this XSD: https://www.myflfamilies.com/service-programs/samh/155-2/155-2-v14/schemas/TreatmentEpisodeDataset.xsd

I want to do the same thing using import xml.etree.ElementTree as ET ...but ElementTree does not seem to have an equivalent function to lxml getelementpath().

I've read the docs. I've googled for days. I've experimented with XPath. I've guessed using iter() and tried "getpath()", "Element.getpath()", etc. hoping to discover an undocumented feature. Fail.

Perhaps I am experiencing an extreme case of "user error" and please forgive me if this is a duplicate.

I thought I found the answer here: Get Xpath dynamically using ElementTree getpath() but the XPathEvaluator only seems to operate on a 'known' element - it doesn't have an option for "give me everything".

Here is what I tried:

import xml.etree.ElementTree as ET
tree = etree.parse(xml_file)
for entry in tree.xpath('//TreatmentEpisode'):
    print(entry)

Results:

<Element TreatmentEpisode at 0xffff8f8c8a00>

What I was hoping for:

TreatmentEpisodes/TreatmentEpisode

...however, even if I received what I hoped for, I am still not sure how to obtain the full path for every element. As I understand the XPath docs, they only operate on 'known' element names. i.e. tree.xpath() seems to require the element name to be known beforehand.

Mercy answered 1/7, 2021 at 18:35 Comment(10)
It sounds like you did research and attempted code to solve it ... now provide at least what you "experimented with XPath", your code - even if failing: minimal reproducible example. So we can see how to adjust.Robespierre
Fair. I can say this: I thought I found the answer here: #13136834 but the XPathEvaluator only seems to operate on a 'known' element - it doesn't have an option for "give me everything". However, I will put together more examples of what I tried and edit my question.Mercy
@Robespierre question updated with example attempt.Mercy
Shouldn’t be too hard to write a function to build the path of an element. Did you try that?Tenebrae
I take that back - hadn’t realised there isn’t a parent attribute :-( However this shows a way to build a child->parent dictionary which can easily be the basis of getting the element path #2171110Tenebrae
@barny - your first comment was the trick. I never tried the function example in the later half of the SO link I quoted in my own question! Definitely severe user error going on here as I tried that function just now on your suggestion and it worked! Perhaps my eyes failed from reading too many forums but that was the answer right in front of me. Davide Brunato's post should be super flagged or something.Mercy
You should probably upvote that answerTenebrae
@barny Tried. Won't let me until I have "15 reputation". I've been lurking here for 15 years - this was my first post. :(Mercy
Ok - try not to forget to upvote once you have the repTenebrae
@Mercy you should have enough reputation to mark the correct answer and upvote it now :) - I think.Venose
S
7

Start from:

import xml.etree.ElementTree as et

An interesting way to solve your problem is to use iterparse - an iterative parser contained in ElementTree.

It is able to report e.g. each start and end event, for each element parsed. For details search the Web for documentation / examples of iterparse.

The idea is to:

  1. Start with an empty list as the path.
  2. At the start event, append the element name to path and print the full path gathered so far.
  3. At the end event, drop the last element from path.

You can even wrap this code in a generator function:

def pathGen(fn):
    path = []
    it = et.iterparse(fn, events=('start', 'end'))
    for evt, el in it:
        if evt == 'start':
            path.append(el.tag)
            yield '/'.join(path)
        else:
            path.pop()

Now, when you run:

for pth in pathGen('Input.xml'):
    print(pth)

you will get a printout of full paths of all elements in your source file, something like:

TreatmentEpisodes
TreatmentEpisodes/TreatmentEpisode
TreatmentEpisodes/TreatmentEpisode/SourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode/FederalTaxIdentifier
TreatmentEpisodes/TreatmentEpisode/ClientSourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode
TreatmentEpisodes/TreatmentEpisode/SourceRecordIdentifier
TreatmentEpisodes/TreatmentEpisode/FederalTaxIdentifier
TreatmentEpisodes/TreatmentEpisode/ClientSourceRecordIdentifier
...
Saltworks answered 3/7, 2021 at 19:42 Comment(2)
I would upvote your answer but I still don't have the minimum reputation to do so!Mercy
For future readers of this post, Valdi_Bo's answer works and also Davide Brunato's answer in this post: #13136834Mercy

© 2022 - 2024 — McMap. All rights reserved.