Parse XML with Python Etree and Return Specified Tag Regardless of Namespace
Asked Answered
G

1

7

I am working with some XML data that, in some locations in each file, redefines the namespace. I'm trying to pull all tags of a specific type from the document regardless of the namespace that's active at the point where the tag resides in the XML.

I'm using findall('.//{namespace}Tag') to find the elements I'm looking for. But never knowing what the {namespace} will be at any given point in the file, makes it hit or miss whether I'll get all the requested Tags returned or not.

Is there a way to return all the Tag elements regardless of the {namespace} they fall under? Something along the lines of findall('.//{wildcard}Tag')?

Glabrescent answered 20/11, 2011 at 5:49 Comment(1)
Given this question hasn't gotten an answer in some time now, here some suggestions. If you have already solved your problem, great! But be sure to also post it here so we don't get a Fermat thread on our hands. If not, a code and XML example of the problem would be handy.Chip
P
3

The xpath function of lxml supports local-name()!

Here is a Python 3 example:

import io
from lxml import etree
xmlstring = '''<root
xmlns:m="http://www.w3.org/html4/"
xmlns:n="http://www.w3.org/html5/">
<m:table>
  <m:tr>
    <m:name>Sometext</m:name>
  </m:tr>
</m:table>
<n:table>
  <n:name>Othertext</n:name>
</n:table>
</root>'''
root = etree.parse(io.StringIO(xmlstring))
names = root.xpath("//*[local-name() = 'name']")
for name in names:
    print(name.text)

Your question might have been aswered before at: lxml etree xmlparser namespace problem

Parkins answered 16/4, 2012 at 21:5 Comment(3)
what does this output. did you run it? Not sure if this actually works.Parallelism
I get this error: Traceback (most recent call last): File xml_test.py", line 15, in <module> root = etree.parse(io.StringIO(xmlstring)) TypeError: initial_value must be unicode or None, not strParallelism
The actual output is "Sometext\nOthertext\n"Parkins

© 2022 - 2024 — McMap. All rights reserved.