Simple dom traversing in Python using xml.etree.ElementTree

Asked 15/1, 2014 at 19:26 Answered 15/1, 2014 at 19:45

E.g. consider parsing a pom.xml file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

    <parent>
        <groupId>com.parent</groupId>
        <artifactId>parent</artifactId>
        <version>1.0-SNAPSHOT</version>
        <relativePath>../pom.xml</relativePath>
    </parent>

    <modelVersion>2.0.0</modelVersion>
    <groupId>com.parent.somemodule</groupId>
    <artifactId>some_module</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>Some Module</name>
    ...

Code:

import xml.etree.ElementTree as ET

tree = ET.parse(pom)
root = tree.getroot()

groupId = root.find("groupId")
artifactId = root.find("artifactId")

Both groupId and artifactId are None. Why when they are the direct descendants of the root? I tried to replace the root with tree (groupId = tree.find("groupId")) but that didn't change anything.

Beecham answered 15/1, 2014 at 19:26 Comment(1)

possible duplicate of Parsing XML with namespace in Python ElementTree – Selfpity 15/1, 2014 at 19:38

The problem is that you don't have a child named groupId, you have a child named {http://maven.apache.org/POM/4.0.0}groupId, because etree doesn't ignore XML namespaces, it uses "universal names". See Working with Namespaces and Qualified Names in the effbot docs.

Eelpout answered 15/1, 2014 at 19:30 Comment(6)

can i somehow make it ignore the namespace? – Beecham 15/1, 2014 at 19:31

@amphibient: Not directly, no. If you read the doc page I linked, it shows you the various ways of dealing with this correctly. – Eelpout 15/1, 2014 at 19:32

@amphibient: It's not retarded; XML that uses namespaces to resolve ambiguity problems would be broken if you ignored them. (XML as a whole is kind of retarded, but that's a different story…) For quick&dirty scripts, you want a quick&dirty parser like BeautifulSoup, not a parser that tries to be correct. – Eelpout 15/1, 2014 at 19:35

@amphibient: Anyway, I could give you code to solve your problem, but if you don't actually understand namespaces and universal names, that code won't do you any good, so you pretty much have to read that document. If you have any questions afterward, I can help. – Eelpout 15/1, 2014 at 19:35

what i consider "retarded" is the inability to disregard the namespace and use it as though the root were simply <project> and not <project xmlns="...">. why wouldn't there be a feature to ignore it for simpler processing? – Beecham 15/1, 2014 at 19:38

@amphibient: Because that would be incorrect as often as it would be useful. It's like saying Python is retarded for not letting you write 'answer: ' + 42. Sure, that would sometimes be useful, but it would also be an attractive nuisance (as languages like PHP and Tcl prove). – Eelpout 15/1, 2014 at 19:50

Just to expand on abarnert's comment about BeautifulSoup, if you DO just want a quick and dirty solution to the problem, this is probably the fastest way to go about it. I have implemented this (for a personal script) that uses bs4, where you can traverse the tree with

element = dom.getElementsByTagNameNS('*','elementname')

This will reference the dom using ANY namespace, handy if you know you've only got one in the file so there's no ambiguity.

Breath answered 15/1, 2014 at 19:45 Comment(0)

Recommended topics

Hot tags