How to preserve namespaces when parsing xml via ElementTree in Python
Asked Answered
U

1

21

Assume that I've the following XML which I want to modify using Python's ElementTree:

<root xmlns:prefix="URI">
  <child company:name="***"/>
  ...
</root> 

I'm doing some modification on the XML file like this:

import xml.etree.ElementTree as ET
tree = ET.parse('filename.xml')
# XML modification here
# save the modifications
tree.write('filename.xml')

Then the XML file looks like:

<root xmlns:ns0="URI">
  <child ns0:name="***"/>
  ...
</root>

As you can see, the namepsace prefix changed to ns0. I'm aware of using ET.register_namespace() as mentioned here.

The problem with ET.register_namespace() is that:

  1. You need to know prefix and URI
  2. It can not be used with default namespace.

e.g. If the xml looks like:

<root xmlns="http://uri">
    <child name="name">
    ...
    </child>
</root>

It will be transfomed to something like:

<ns0:root xmlns:ns0="http://uri">
    <ns0:child name="name">
    ...
    </ns0:child>
</ns0:root>

As you can see, the default namespace is changed to ns0.

Is there any way to solve this problem with ElementTree?

Underplot answered 30/1, 2019 at 11:15 Comment(8)
Possible duplicate of xml.etree.ElementTree - Trouble setting xmlns = '...'Coon
The dup link uses clearly ET.register_namespace(.... Edit your Question to minimal reproducible example to show how you use it.Coon
@Coon It's not about preserving the namespace and didn't help me. The name space should not be hard coded, it can be xmlns:prefix="URI" with any prefix and URI.Underplot
The only way to preserve the namespace prefix with ElementTree is by using register_namespace(). If you don't like that, try lxml instead.Abrasive
@mzin You need to know prefix and URI when using register_namespace(). As I said, I don't want to hard code the namespace. Is there any way to do this with ElementTree?Underplot
@Coon Editted the question to clearify the problem.Underplot
@AmirRezazadeh: Read lxml namespaces lxml.etree allows you to look up the current namespaces defined for a node through the .nsmap property:.Coon
See https://mcmap.net/q/137836/-get-the-namespaces-from-xml-with-python-elementtree for a way to get the namespaces in the document.Abrasive
U
40

ElementTree will replace those namespaces' prefixes that are not registered with ET.register_namespace. To preserve a namespace prefix, you need to register it first before writing your modifications on a file. The following method does the job and registers all namespaces globally,

def register_all_namespaces(filename):
    namespaces = dict([node for _, node in ET.iterparse(filename, events=['start-ns'])])
    for ns in namespaces:
        ET.register_namespace(ns, namespaces[ns])

This method should be called before ET.parse method, so that the namespaces will remain as unchanged,

import xml.etree.ElementTree as ET
register_all_namespaces('filename.xml')
tree = ET.parse('filename.xml')
# XML modification here
# save the modifications
tree.write('filename.xml')
Underplot answered 2/2, 2019 at 7:51 Comment(5)
This solution is much better than I have seen on many other questions for the same topic. Thanks for sharing it.Vachel
does this mean the xml needs to be parsed twice? or can i somehow get the ElementTree out of this process, as i do it?Miffy
@Starwarswii Yes, if you want more control on that I think you can use XMLPullParser with start-ns event, fetching namespaces and then calling ET.register_namespace.Underplot
thank you for this answer. I was pulling my hair out with my namespaces getting replaced after a simple tweak to the XML.Ikkela
It does not matter if register_namespace comes before or after ET.parse. register_namespace only affects serialization, not parsing.Abrasive

© 2022 - 2024 — McMap. All rights reserved.