Simple dom traversing in Python using xml.etree.ElementTree
Asked Answered
B

2

1

E.g. consider parsing a pom.xml file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

    <parent>
        <groupId>com.parent</groupId>
        <artifactId>parent</artifactId>
        <version>1.0-SNAPSHOT</version>
        <relativePath>../pom.xml</relativePath>
    </parent>

    <modelVersion>2.0.0</modelVersion>
    <groupId>com.parent.somemodule</groupId>
    <artifactId>some_module</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>Some Module</name>
    ...

Code:

import xml.etree.ElementTree as ET

tree = ET.parse(pom)
root = tree.getroot()

groupId = root.find("groupId")
artifactId = root.find("artifactId")

Both groupId and artifactId are None. Why when they are the direct descendants of the root? I tried to replace the root with tree (groupId = tree.find("groupId")) but that didn't change anything.

Beecham answered 15/1, 2014 at 19:26 Comment(1)
possible duplicate of Parsing XML with namespace in Python ElementTreeSelfpity
E
4

The problem is that you don't have a child named groupId, you have a child named {http://maven.apache.org/POM/4.0.0}groupId, because etree doesn't ignore XML namespaces, it uses "universal names". See Working with Namespaces and Qualified Names in the effbot docs.

Eelpout answered 15/1, 2014 at 19:30 Comment(6)
can i somehow make it ignore the namespace?Beecham
@amphibient: Not directly, no. If you read the doc page I linked, it shows you the various ways of dealing with this correctly.Eelpout
@amphibient: It's not retarded; XML that uses namespaces to resolve ambiguity problems would be broken if you ignored them. (XML as a whole is kind of retarded, but that's a different story…) For quick&dirty scripts, you want a quick&dirty parser like BeautifulSoup, not a parser that tries to be correct.Eelpout
@amphibient: Anyway, I could give you code to solve your problem, but if you don't actually understand namespaces and universal names, that code won't do you any good, so you pretty much have to read that document. If you have any questions afterward, I can help.Eelpout
what i consider "retarded" is the inability to disregard the namespace and use it as though the root were simply <project> and not <project xmlns="...">. why wouldn't there be a feature to ignore it for simpler processing?Beecham
@amphibient: Because that would be incorrect as often as it would be useful. It's like saying Python is retarded for not letting you write 'answer: ' + 42. Sure, that would sometimes be useful, but it would also be an attractive nuisance (as languages like PHP and Tcl prove).Eelpout
B
1

Just to expand on abarnert's comment about BeautifulSoup, if you DO just want a quick and dirty solution to the problem, this is probably the fastest way to go about it. I have implemented this (for a personal script) that uses bs4, where you can traverse the tree with

element = dom.getElementsByTagNameNS('*','elementname')

This will reference the dom using ANY namespace, handy if you know you've only got one in the file so there's no ambiguity.

Breath answered 15/1, 2014 at 19:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.