How can I make my code more readable and DRYer when working with XML namespaces in Python?
Asked Answered
F

2

9

Python's built-in xml.etree package supports parsing XML files with namespaces, but namespace prefixes get expanded to the full URI enclosed in brackets. So in the example file in the official documentation:

<actors xmlns:fictional="http://characters.example.com"
    xmlns="http://people.example.com">
    <actor>
        <name>John Cleese</name>
        <fictional:character>Lancelot</fictional:character>
        <fictional:character>Archie Leach</fictional:character>
    </actor>
    ...

The actor tag gets expanded to {http://people.example.com}actor and fictional:character to {http://characters.example.com}character.

I can see how this makes everything very explicit and reduces ambiguity (the file could have the same namespace with a different prefix, etc.) but it is very cumbersome to work with. The Element.find() method and others allow passing a dict mapping prefixes to namespace URIs so I can still do element.find('fictional:character', nsmap) but to my knowledge there is nothing similar for tag attributes. This leads to annoying stuff like element.attrib['{{{}}}attrname'.format(nsmap['prefix'])].

The popular lxml package provides the same API with a few extensions, one of which is an nsmap property on the elements that they inherit from the document. However none of the methods seem to actually make use of it, so I still have to do element.find('fictional:character', element.nsmap) which is just unnecessarily repetitive to type out every time. It also still doesn't work with attributes.

Luckily lxml supports subclassing BaseElement, so I just made one with a p (for prefix) property that has the same API but automatically uses namespace prefixes using the element's nsmap (Edit: likely best to assign a custom nsmap defined in code). So I just do element.p.find('fictional:character') or element.p.attrib['prefix:attrname'], which much less repetitive and I think way more readable.

I just feel like I'm really missing something though, it really feels like this should really already be a feature of lxml if not the builtin etree package. Am I somehow doing this wrong?

Foudroyant answered 16/5, 2016 at 21:14 Comment(8)
That lxml doesn't use the element's nsmap by default follows from the design principles underlying XML: The semantic meaning of a document shouldn't depend on what its prefixes are named, so if code manipulating a document is depending on the prefixes rather than the namespaces they map to, that code is Doing It Wrong.Wife
...using a nsmap defined in your code, rather than in your document, thus ensures that your prefixes actually map to the targets you expect (not to mention avoids a dependency on a document using the precise prefixes you expect!).Wife
BTW, a title that keeps further afield of the "rant in disguise" class of off-topic questions might be something like "How can I avoid repeating nsmap parameters using lxml in Python?"Wife
@CharlesDuffy I did mention how this approach avoids the problem of the XML using different prefixes, but even if I defined my own nsmap I would still need to pass it each time because applying it to an lxml.Etree element object still doesn't actually do anything. Passing an nsmap to find() each time isn't too bad, but element.attrib['{{{}}}attrname'.format(nsmap['prefix'])] all over the place is pretty awful regardless of whether I have defined a static nsmap in code or not.Foudroyant
I'm not sure that any of the above argues effectively against changing the title to something less rant-y and more closely tied to your actual question. (That you have a long answer from a moderately-high-rep user that doesn't actually meet your needs is also an experimental data point, while not conclusive, supporting the argument that the present title is leading folks down the wrong path re: understanding your actual intent).Wife
@CharlesDuffy I also do not see this as a rant in any way, I clearly asked a constructive question (see the last paragraph). Python is advertised as a "batteries-included" language, so when I find myself having to write utility functions to make some very common operations bearable (even using the most popular independent library for the job) it seems like something is wrong. The problem could be a. the library was badly/lazily designed or not a priority, b. there are technical reasons preventing what I want to do, or c. I have missed something. Those last two would be good to know.Foudroyant
@Jumpy89, there's a reason I'm suggesting changes to the title, not the question. The title sets the tone and interpretation for other content -- a rant-y title followed by a constructive question gets the body text read in a less-constructive light. "X seems difficult" is inherently subjective, even though "X requires a great deal of repetitive code" is not. Anyhow, "How can I do thing-X without repeating myself?" is an unambiguously constructive and useful StackOverflow question, whereas "why is this difficult?" or "am I missing something?" is less so.Wife
@CharlesDuffy I guess I interpreted that as you attacking/not understanding my question - yes I do see how the title alone appears rant-y. I have changed it to something I think is clearer.Foudroyant
O
7

Is it possible to get rid of the namespace mapping?

Do you need to pass it as a parameter into each function call? An option would be to set the prefixes to be used at the XML document in a property.

That's fine until you pass the XML document into a 3rd party function. That function wants to use prefixes as well, so it sets the property to something else, because it does not know what you set it to.

As soon as you get the XML document back, it was modified, so your prefixes don't work any more.

All in all: no, it's not safe and therefore it's good as it is.

This design does not only exist in Python, it also exists in .NET. The SelectNodes() [MSDN] can be used if you don't need prefixes. But as soon as there's a prefix present, it'll throw an exception. Therefore, you have to use the overloaded SelectNodes() [MSDN] which uses an XmlNamespaceManager as a parameter.

XPath as a solution

I suggest to learn XPath (lxml specific link), where you can use prefixes. Since this may be version specific, let me say I ran this code with Python 2.7 x64 and lxml 3.6.0 (I'm not too familiar with Python, so this may not be the cleanest code, but it serves well as a demonstration):

from lxml import etree as ET
from pprint import pprint
data = """<?xml version="1.0"?>
<d:data xmlns:d="dns">
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor d:name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
</d:data>"""
root = ET.fromstring(data)
my_namespaces = {'x':'dns'}
xp=root.xpath("/x:data/country/neighbor/@x:name", namespaces=my_namespaces)
pprint(xp)
xp=root.xpath("//@x:name", namespaces=my_namespaces)
pprint(xp)
xp=root.xpath("/x:data/country/neighbor/@name", namespaces=my_namespaces)
pprint(xp)

The output is

C:\Python27x64\python.exe E:/xpath.py
['Austria']
['Austria']
['Switzerland', 'Malaysia']

Process finished with exit code 0

Note how well XPath solved the mapping from x prefix in the namespace table to the d prefix in the XML document.

This eliminates the really awful to read element.attrib['{{{}}}attrname'.format(nsmap['prefix'])].

Short (and incomplete) XPath introduction

To select an element, write /element, optionally use a prefix.

xp=root.xpath("/x:data", namespaces=my_namespaces)

To select an attribute, write /@attribute, optionally use a prefix.

#See example above

To navigate down, concatenate several elements. Use // if you don't know items in between. To move up, use /... Attributes must be last if not followed by /...

xp=root.xpath("/x:data/country/neighbor/@x:name/..", namespaces=my_namespaces)

To use a condition, write it in square brackets. /element[@attribute] means: select all elements that have this attribute. /element[@attribute='value'] means: select all elements that have this attribute and the attribute has a specific value. /element[./subelement] means: select all elements that have a subelement with a specific name. Optionally use prefixes anywhere.

xp=root.xpath("/x:data/country[./neighbor[@name='Switzerland']]/@name", namespaces=my_namespaces)

There's much more to discover, like text(), various ways of sibling selection and even functions.

About the 'why'

The original question title which was

Why does working with XML namespaces seem so difficult in Python?

For some users, they just don't understand the concept. If the user understands the concept, maybe the developer didn't. And perhaps it was just one option out of many and the decision was to go that direction. The only person who could give an answer on the "why" part in such a case would be the developer himself.

References

Octroi answered 2/6, 2016 at 21:11 Comment(5)
If you read the OP's text more closely, they aren't actually asking a "why" question; rather, their real question might be better posed as "How can I avoid repeating nsmap parameters using ElementTree in Python?"Wife
Thanks Thomas, I was aware of xpath but mostly avoid it for a few reasons - I still have to pass my nsmap each time (that's ok, I can just use functools.partial), it always returns a list where I frequently want a single result, and mostly it seems to take significantly longer on large files than a few lines of find() and other methods (despite being less verbose). It is extremely useful for more complicated queries, but for the simple stuff I think I will stick to using my own utility functions.Foudroyant
Since there doesn't quite seem to be a good alternative to what I have been doing I'm going to mark this as the accepted answer because I think an intro to xpath would likely be good information to others coming here with similar problems.Foudroyant
@Jumpy89: it'll only work if you pass the namespace mapping. That's similar in .NET.. If you use namespaces, you have to use that method. If you use namespaces and omit the namespace manager, it'll throw an exception. So I guess it's normal...Octroi
Thanks Thomas. Both you and @Parfait had excellent answers that I think will be useful for others. I chose to award Thomas's answer because it was less ugly than the OP's code but always correct, whereas Parfait's answer was very clean but could lead to ambiguity in some cases if not used with care. Both answers require learning a something new. XPath for this one and XLST for Parfait's. Again, both are really good answers.Osis
F
3

If you need to avoid repeating nsmap parameters using ElementTree in Python, consider transforming your XML with XSLT to remove namespaces and return local element names. And Python's lxml can run XSLT 1.0 scripts.

As information, XSLT is a special-purpose declarative language (same family as XPath but interacts with whole documents) used specifically to transform XML sources. In fact, XSLT scripts are well-formed XML documents! And removing namespaces is an often used task for end user needs.

Consider the following with XML and XSLT embedded as strings (but each can be parsed from file). Once transformed, you can run .findall(), iter(), and .xpath() on the transformed new tree object without need of defining namespace prefixes:

Script

import lxml.etree as ET

# LOAD XML AND XSL STRINGS
xmlStr = '''
         <actors xmlns:fictional="http://characters.example.com"
                 xmlns="http://people.example.com">
             <actor>
                 <name>John Cleese</name>
                 <fictional:character>Lancelot</fictional:character>
                 <fictional:character>Archie Leach</fictional:character>
             </actor>
         </actors>
         '''
dom = ET.fromstring(xmlStr)

xslStr = '''
        <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output version="1.0" encoding="UTF-8" indent="yes" />
        <xsl:strip-space elements="*"/>

          <xsl:template match="@*|node()">
            <xsl:element name="{local-name()}">
              <xsl:apply-templates select="@*|node()"/>
            </xsl:element>
          </xsl:template>

          <xsl:template match="text()">
            <xsl:copy/>
          </xsl:template>

        </xsl:transform>
        '''
xslt = ET.fromstring(xslStr)

# TRANSFORM XML
transform = ET.XSLT(xslt)
newdom = transform(dom)

# OUTPUT AND PARSE
print(str(newdom))

for i in newdom.findall('//character'):
    print(i.text)

for i in newdom.iter('character'):
    print(i.text)

for i in newdom.xpath('//character'):
    print(i.text)

Output

<?xml version="1.0"?>
<actors>
  <actor>
    <name>John Cleese</name>
    <character>Lancelot</character>
    <character>Archie Leach</character>
  </actor>
</actors>

Lancelot
Archie Leach
Lancelot
Archie Leach
Lancelot
Archie Leach
Frequency answered 4/6, 2016 at 15:11 Comment(2)
There are cases where it is not possible or at least not trivial to remove namespaces. Consider an excel:table of furniture:tables. In your case, selecting a table in the resulting XML will result in a different node count.Octroi
Indeed @ThomasWeller. XSLT can even modify element names to concatenate namespaces and local names or by other uniqueness. Or use result but run xpath/find parsing by <table> levels or position() for distinction. I offer OP an alternative if namespaces are too much a challenge.Frequency

© 2022 - 2024 — McMap. All rights reserved.