Parsing XML with namespace in Python via 'ElementTree'
Asked Answered
O

8

200

I have the following XML which I want to parse using Python's ElementTree:

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

Because of the namespace, I am getting the following error.

SyntaxError: prefix 'owl' not found in prefix map

I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.

Kindly let me know how to change the code to find all the owl:Class tags.

Odontoblast answered 13/2, 2013 at 12:8 Comment(1)
Since Python 3.8, a namespace wildcard can be used with find(), findall() and findtext(). See https://mcmap.net/q/122218/-how-to-use-python-xml-findall-to-find-39-lt-v-imagedata-r-id-quot-rid7-quot-o-title-quot-1-ren-quot-gt-39.Reinhardt
A
265

You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary:

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:

root.findall('{http://www.w3.org/2002/07/owl#}Class')

Also see the Parsing XML with Namespaces section of the ElementTree documentation.

If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in .nsmap attribute on elements and generally has superior namespaces support.

Alewife answered 13/2, 2013 at 12:18 Comment(11)
Thanks. Especially for the second part, where you can give the namespace directly.Odontoblast
Thank you. Any idea how can I get the namespace directly from XML, without hard-coding it? Or how can I ignore it? I've tried findall('{*}Class') but it wont work in my case.Egomania
You'd have to scan the tree for xmlns attributes yourself; as stated in the answer, lxml does this for you, the xml.etree.ElementTree module does not. But if you are trying to match a specific (already hardcoded) element, then you are also trying to match a specific element in a specific namespace. That namespace is not going to change between documents any more than the element name is. You may as well hardcode that with the element name.Alewife
@Jon: register_namespace only influences serialisation, not search.Alewife
Small addition that may be useful: when using cElementTree instead of ElementTree, findall will not take namespaces as a keyword argument, but rather simply as a normal argument, i.e. use ctree.findall('owl:Class', namespaces).Anteater
@egpbos: adjusted to be cElementTree compatible.Alewife
Thanks a lot Martijn, where did you find that findall() as an extra argument ? docs.python.org does not mention it.Biretta
@Bludwarf: The docs do mention it (now, if not when you wrote that), but you have to read them verrrry carefully. See the Parsing XML with Namespaces section: there's an example contrasting the use of findall without and then with the namespace argument, but the argument is not mentioned as one of the arguments to the method method in the Element object section.Alumna
@MartijnPieters, how do I get the value of the attribute xml:lang of the rdfs:label element?Equidistant
Just a reminder. It takes me hours to debug and find that the second parameter in findtext() is not namespace. So it needs to be written as findtext('./prefix:tag', namespaces=prefix_map)Alphonsa
@Alphonsa more recent Python 3 versions use the Argument Clinic to handle argument parsing for most cEmementTree methods and thus find and findall now support namespace as a keyword argument.Alewife
L
68

Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

UPDATE:

5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.

Here's another case and how I handled it:

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)
Lipophilic answered 7/11, 2014 at 18:22 Comment(4)
The full namespace URL is the namespace identifier you're supposed to hard-code. The local prefix (owl) can change from file to file. Therefore doing what this answer suggests is a really bad idea.Tether
@MattiVirkkunen exactly if the owl definition can change from file to file, shouldn't we use the definition defined in each files instead of hardcoding it?Wayzgoose
@LoïcFaure-Lacroix: Usually XML libraries will let you abstract that part out. You don't need to even know or care about the prefix used in the file itself, you just define your own prefix for the purpose of parsing or just use the full namespace name.Tether
this answer helped my to at least be able to use the find function. No need to create your own prefix. I just did key = list(root.nsmap.keys())[0] and then added the key as prefix: root.find(f'{key}:Tag2', root.nsmap)Retiform
A
46

Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.

To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

Then the dictionary can be passed as argument to the search functions:

root.findall('owl:Class', my_namespaces)
Abshire answered 24/5, 2016 at 9:9 Comment(8)
This is useful for those of us without access to lxml and without wanting to hardcode namespace.Markland
I got the error:ValueError: write to closed for this line filemy_namespaces = dict([node for _, node in ET.iterparse(StringIO(my_schema), events=['start-ns'])]). Any idea wants wrong?Sivie
Probably the error is related with the class io.StringIO, that refuses ASCII strings. I had tested my recipe with Python3. Adding the unicode string prefix 'u' to the sample string it works also with Python 2 (2.7).Abshire
Instead of dict([...]) you can also use dict comprehension.Incogitable
Instead of StringIO(my_schema) you can also put the filename of the XML file.Col
This is exactly what I was looking for! Thank you!Chesty
Where is root defined, that calls findall()?Danell
No, iterparse() is not related with find/findall/finditer. It uses the XML parser to iterate over tree nodes, including the start and the end (so the scope) of namespaces declarations.Abshire
S
7

I've been using similar code to this and have found it's always worth reading the documentation... as usual!

findall() will only find elements which are direct children of the current tag. So, not really ALL.

It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included. If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.

root.iter()

ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements "Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"

Simmonds answered 16/8, 2016 at 9:51 Comment(1)
The ElementTree documentation is a bit unclear and easy to misunderstand, IMHO. It is possible to get all descendants. Instead of elem.findall("X"), use elem.findall(".//X").Reinhardt
I
7

To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).

link = root.find(f"{ns}link")
Isopropyl answered 1/10, 2018 at 12:25 Comment(0)
A
3

This is basically Davide Brunato's answer however I found out that his answer had serious problems the default namespace being the empty string, at least on my python 3.6 installation. The function I distilled from his code and that worked for me is the following:

from io import StringIO
from xml.etree import ElementTree
def get_namespaces(xml_string):
    namespaces = dict([
            node for _, node in ElementTree.iterparse(
                StringIO(xml_string), events=['start-ns']
            )
    ])
    namespaces["ns0"] = namespaces[""]
    return namespaces

where ns0 is just a placeholder for the empty namespace and you can replace it by any random string you like.

If I then do:

my_namespaces = get_namespaces(my_schema)
root.findall('ns0:SomeTagWithDefaultNamespace', my_namespaces)

It also produces the correct answer for tags using the default namespace as well.

Atworth answered 7/4, 2021 at 16:13 Comment(0)
H
1

My solution is based on @Martijn Pieters' comment:

register_namespace only influences serialisation, not search.

So the trick here is to use different dictionaries for serialization and for searching.

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

Now, register all namespaces for parsing and writing:

for name, value in namespaces.items():
    ET.register_namespace(name, value)

For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).

self.namespaces['default'] = self.namespaces['']

Now, the functions from the find() family can be used with the default prefix:

print root.find('default:myelem', namespaces)

but

tree.write(destination)

does not use any prefixes for elements in the default namespace.

Hinshaw answered 30/5, 2019 at 11:0 Comment(3)
For python version 3.11 use namespaces.items() instead of namespaces.iteritems().Perice
@Perice More generally it applies to any version 3.0 or higher.Granulite
@Hermann12: Yes, I've update my answer accordingly. Thank you for pointing it out. Yet, .find() and .iterfinid() are ElementTree's methods and they behave differently.Hinshaw
G
0

A slightly longer alternative is to create another class ElementNS which inherits ET.Element and includes the namespaces, then create a constructor for this class which is passed onto the parser:

import xml.etree.ElementTree as ET


def parse_namespaces(source):
    return dict(node for _e, node in ET.iterparse(source, events=['start-ns']))


def create_element_factory(namespaces):
    def element_factory(tag, attrib):
        el = ElementNS(tag, attrib)
        el.namespaces = namespaces
        return el
    return element_factory


class ElementNS(ET.Element):
    namespaces = None

    # Patch methods to include namespaces
    def find(self, path):
        return super().find(path, self.namespaces)

    def findtext(self, path, default=None):
        return super().findtext(path, default, self.namespaces)

    def findall(self, path):
        return super().findall(path, self.namespaces)

    def iterfind(self, path):
        return super().iterfind(path, self.namespaces)


def parse(source):
    # Set up parser with namespaced element factory
    namespaces = parse_namespaces(source)
    element_factory = create_element_factory(namespaces)
    tree_builder = ET.TreeBuilder(element_factory=element_factory)
    parser = ET.XMLParser(target=tree_builder)
    element_tree = ET.ElementTree()

    return element_tree.parse(source, parser=parser)

Then findall can be used without passing namespaces:

document = parse("filename")
document.findall("owl:Class")
Granulite answered 15/12, 2023 at 18:38 Comment(2)
Very complicated to reach the same result as described above.Perice
@Perice if you've written e.find(..., namespaces) a couple dozen times, it makes sense to make a class for it, so you only have to write e.find(...). However note that this is likely slower, as it can't rely on the C implementation of ET.Element.Granulite

© 2022 - 2024 — McMap. All rights reserved.