Accessing XMLNS attribute with Python Elementree?
Asked Answered
G

4

26

How can one access NS attributes through using ElementTree?

With the following:

<data xmlns="http://www.foo.net/a" xmlns:a="http://www.foo.net/a" book="1" category="ABS" date="2009-12-22">

When I try to root.get('xmlns') I get back None, Category and Date are fine, Any help appreciated..

Gutsy answered 23/12, 2009 at 16:19 Comment(2)
I can't answer your question - but having struggled against this shortcoming for a couple of days I'm prepared to claim that it isn't possible using with the current ElementTree API. In my application I needed to detect whether an xmlns:xlink attribute already existed on the root element, and if not, add it. It's not possible to test whether an xmlns attribute already exists and what is more, ElementTree is happy to add it twice if you try. Since either zero or two identical xmlns attributes in the same element cause an error in most XML consumers this make ElementTree very difficult to use.Isbella
This is a very relevant answer now: from 2017 timeframeTroll
H
18

I think element.tag is what you're looking for. Note that your example is missing a trailing slash, so it's unbalanced and won't parse. I've added one in my example.

>>> from xml.etree import ElementTree as ET
>>> data = '''<data xmlns="http://www.example.net/a"
...                 xmlns:a="http://www.example.net/a"
...                 book="1" category="ABS" date="2009-12-22"/>'''
>>> element = ET.fromstring(data)
>>> element
<Element {http://www.example.net/a}data at 1013b74d0>
>>> element.tag
'{http://www.example.net/a}data'
>>> element.attrib
{'category': 'ABS', 'date': '2009-12-22', 'book': '1'}

If you just want to know the xmlns URI, you can split it out with a function like:

def tag_uri_and_name(elem):
    if elem.tag[0] == "{":
        uri, ignore, tag = elem.tag[1:].partition("}")
    else:
        uri = None
        tag = elem.tag
    return uri, tag

For much more on namespaces and qualified names in ElementTree, see effbot's examples.

Huckleberry answered 23/12, 2009 at 18:3 Comment(3)
Why is there not a function like this in the library? It seems like every xml file with a namespace would need it. Am I missing it?Chlorothiazide
@clutch I am wondering the same thing. Anyone know a reason why?Gallimaufry
@rednaw, I'm not convinced split is better. Partition is guaranteed to return a tuple of exactly three elements, split can return an arbitrary number of elements. In practice it would be syntactically invalid to have anything but one closing curly brace, but still. I think partition is better.Huckleberry
L
15

Look at the effbot namespaces documentation/examples; specifically the parse_map function. It shows you how to add an ns_map attribute to each element which contains the prefix/URI mapping that applies to that specific element.

However, that adds the ns_map attribute to all the elements. For my needs, I found I wanted a global map of all the namespaces used to make element look up easier and not hardcoded.

Here's what I came up with:

import elementtree.ElementTree as ET

def parse_and_get_ns(file):
    events = "start", "start-ns"
    root = None
    ns = {}
    for event, elem in ET.iterparse(file, events):
        if event == "start-ns":
            if elem[0] in ns and ns[elem[0]] != elem[1]:
                # NOTE: It is perfectly valid to have the same prefix refer
                #     to different URI namespaces in different parts of the
                #     document. This exception serves as a reminder that this
                #     solution is not robust.    Use at your own peril.
                raise KeyError("Duplicate prefix with different URI found.")
            ns[elem[0]] = "{%s}" % elem[1]
        elif event == "start":
            if root is None:
                root = elem
    return ET.ElementTree(root), ns

With this you can parse an xml file and obtain a dict with the namespace mappings. So, if you have an xml file like the following ("my.xml"):

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"\
>
<feed>
  <item>
    <title>Foo</title>
    <dc:creator>Joe McGroin</dc:creator>
    <description>etc...</description>
  </item>
</feed>
</rss>

You will be able to use the xml namepaces and get info for elements like dc:creator:

>>> tree, ns = parse_and_get_ns("my.xml")
>>> ns
{u'content': '{http://purl.org/rss/1.0/modules/content/}',
u'dc': '{http://purl.org/dc/elements/1.1/}'}
>>> item = tree.find("/feed/item")
>>> item.findtext(ns['dc']+"creator")
'Joe McGroin'
Ligniform answered 14/4, 2012 at 6:33 Comment(2)
You answered my post at #13018524Former
I found a small bug in your code. I fixed it by setting ns[elem[0]] to elem[1] inside the for loop, because ET namespace dicts don't need the braces.Coop
W
1

Try this:

import xml.etree.ElementTree as ET
import re
import sys

with open(sys.argv[1]) as f:
    root = ET.fromstring(f.read())
    xmlns = ''
    m = re.search('{.*}', root.tag)
    if m:
        xmlns = m.group(0)
    print(root.find(xmlns + 'the_tag_you_want').text)
Warfore answered 19/10, 2018 at 8:17 Comment(0)
C
0
from io import BytesIO

# Assuming xml data comes from the web and saved to `response.content`

# Name spaces `xmlns` extracted from xml
namespaces = {
    node[0] if node[0] else 'atom': node[1] 
    for _, node in ET.iterparse(BytesIO(response.content), events=['start-ns'])
}

In my case, it was necessary to convert the empty string (default namespace) to "atom", otherwise I'd get an error stating that "atom" wasn't found in the namespaces. You may need to use a different string, as I don't think there's any rhythm to the text chosen and required for the default namespace.

Callahan answered 16/5 at 20:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.