Python: ignoring namespaces in xml.etree.ElementTree?
Asked Answered
S

4

9

How can I tell ElementTree to ignore namespaces in an XML file?

For example, I would prefer to query modelVersion (as in statement 1) rather than {http://maven.apache.org/POM/4.0.0}modelVersion (as in statement 2).

pom="""
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
</project>
"""

from xml.etree import ElementTree
ElementTree.register_namespace("","http://maven.apache.org/POM/4.0.0")
root = ElementTree.fromstring(pom)

print 1,root.findall('modelVersion')
print 2,root.findall('{http://maven.apache.org/POM/4.0.0}modelVersion')

1 []
2 [<Element '{http://maven.apache.org/POM/4.0.0}modelVersion' at 0x1006bff10>]
Sachet answered 4/12, 2015 at 6:55 Comment(4)
AFAIK there isn't an easy+clean way to do so, especially not if you're potentially dealing with multiple namespaces. There appears to be a duplicate question here, but I won't wield my dupehammer if you say that those approaches won't work for you (they kind of look like dirty hacks to me).Theresa
Also, lxml might be worth looking into, but it's not part of the standard library.Theresa
sadly I'm sending this to someone who can't install lxml. I hope the standard library incorporates it some day. I posted my current solution which makes me very sad coz one time I told my mom I was a professional programmer. :-/Sachet
see also: Python ElementTree module: How to ignore the namespace of XML filesWeaverbird
A
2

There appears to be no straight-forward pathway, thus I'd simply wrap the find calls, e.g.

from xml.etree import ElementTree as ET

POM = """
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
         xmlns="http://maven.apache.org/POM/4.0.0">
    <modelVersion>4.0.0</modelVersion>
</project>
"""

NSPS = {'foo' : "http://maven.apache.org/POM/4.0.0"}

# sic!
def findall(node, tag):
    return node.findall('foo:' + tag, NSPS) 

root = ET.fromstring(POM)
print(map(ET.tostring, findall(root, 'modelVersion')))

output:

['<ns0:modelVersion xmlns:ns0="http://maven.apache.org/POM/4.0.0">4.0.0</ns0:modelVersion>\n']
Artistry answered 4/12, 2015 at 7:56 Comment(0)
S
1

Here's what I'm presently doing, which makes me incredibly confident that there's a better way.

$ cat pom.xml |
   tr '\n' ' ' |
   sed 's/<project [^>]*>/<project>/' |
   myprogram |
   sed 's/<project>/<project xmlns="http:\/\/maven.apache.org\/POM\/4.0.0" xmlns:xsi="http:\/\/www.w3.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/maven.apache.org\/POM\/4.0.0 http:\/\/maven.apache.org\/maven-v4_0_0.xsd">/'
Sachet answered 4/12, 2015 at 7:57 Comment(5)
instead of sed'ing it in a pipe, you could patch the xml string in the python script or create a dummy namespace and a wrapper function (pls. c my answer below)Artistry
I like fixing it in the pipe coz then my actual program is tidy. If I can switch to a better xml package in the future I'll just be able to drop the stuff in the wrapper.Sachet
Well - if you're already quite happy with your pipe - what exactly are we talking about then :)?Artistry
lol, good question! I was hoping for an answer like "you dummy here's how to turn off the namespace wierdness" but in the absence of that I'm just hoping for the least bad alternative. For my case, that's keeping the python code clean and hiding the horrible horrible horrible code in the filter step. Although I'm trying hard to figure out how to deliver an lxml solution to my downstream peeps!!Sachet
But again - if you want it both as clean as possible now and as much invariant as possible with regard to replacing the xml module you import in the future, creating an adaption layer like I sketched it my answer is the most natural, if not the only, method. Best if it uses the xml module only but not inherits from it by any means, because the latter case you'd build your app around the to-be-replaced interface, whereas in the first, you'd per se populate an invariant interface tailored to you app.Artistry
S
1

Here's the equivalent solution without using the shell. Basic idea:

  • translate <project junk...> to <project>
  • perform "clean" processing without worrying about the namespace
  • translate <project> back to <project junk...>

with the new code:

pom="""
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
</project>
"""
short_project="""<project>"""
long_project="""<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">"""

import re,sys
from xml.etree import ElementTree

# eliminate namespace specs
pom=re.compile('<project [^>]*>').sub(short_project,pom)

root = ElementTree.fromstring(pom)
ElementTree.dump(root)
print 1,root.findall('modelVersion')
print 2,root.findall('{http://maven.apache.org/POM/4.0.0}modelVersion')
mv=root.findall('modelVersion')

# restore the namespace specs
pom=ElementTree.tostring(root)
pom=re.compile(short_project).sub(long_project,pom)
Sachet answered 4/12, 2015 at 16:28 Comment(0)
M
0

Rather than ignore, another approach would be to remove the namespaces in the tree, so there's no need to 'ignore' because they aren't there - see nonagon's answer to this question (and my extension of that to include namespaces on attributes): Python ElementTree module: How to ignore the namespace of XML files to locate matching element when using the method "find", "findall"

Mate answered 4/12, 2015 at 8:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.