Saving XML using ETree in Python. It's not retaining namespaces, and adding ns0, ns1 and removing xmlns tags
Asked Answered
S

2

18

I see there are similar questions here, but nothing that has totally helped me. I've also looked at the official documentation on namespaces but can't find anything that is really helping me, perhaps I'm just too new at XML formatting. I understand that perhaps I need to create my own namespace dictionary? Either way, here is my situation:

I am getting a result from an API call, it gives me an XML that is stored as a string in my Python application.

What I'm trying to accomplish is just grab this XML, swap out a tiny value (The b:string value user ConditionValue/Default but that's irrelevant to this question) and then save it as a string to send later on in a Rest POST call.

The source XML looks like this:

<Context xmlns="http://Test.the.Sdk/2010/07" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<xmlns i:nil="true" xmlns="http://schema.test.org/2004/07/Test.Soa.Vocab" xmlns:a="http://schema.test.org/2004/07/System.Xml.Serialize"/>
<Conditions xmlns:a="http://schema.test.org/2004/07/Test.Soa.Vocab">
    <a:Condition>
        <a:xmlns i:nil="true" xmlns:b="http://schema.test.org/2004/07/System.Xml.Serialize"/>
        <Identifier>a23aacaf-9b6b-424f-92bb-5ab71505e3bc</Identifier>
        <Name>Code</Name>
        <ParameterSelections/>
        <ParameterSetCollections/>
        <Parameters/>
        <Summary i:nil="true"/>
        <Instance>25486d6c-36ba-4ab2-9fa6-0dbafbcf0389</Instance>
        <ConditionValue>
            <ComplexValue i:nil="true"/>
            <Text i:nil="true" xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays"/>
            <Default>
                <ComplexValue i:nil="true"/>
                <Text xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
                    <b:string>NULLCODE</b:string>
                </Text>
            </Default>
        </ConditionValue>
        <TypeCode>String</TypeCode>
    </a:Condition>
    <a:Condition>
        <a:xmlns i:nil="true" xmlns:b="http://schema.test.org/2004/07/System.Xml.Serialize"/>
        <Identifier>0af860f6-5611-4a23-96dc-eb3863975529</Identifier>
        <Name>Content Type</Name>
        <ParameterSelections/>
        <ParameterSetCollections/>
        <Parameters/>
        <Summary i:nil="true"/>
        <Instance>6364ec20-306a-4cab-aabc-8ec65c0903c9</Instance>
        <ConditionValue>
            <ComplexValue i:nil="true"/>
            <Text i:nil="true" xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays"/>
            <Default>
                <ComplexValue i:nil="true"/>
                <Text xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
                    <b:string>Standard</b:string>
                </Text>
            </Default>
        </ConditionValue>
        <TypeCode>String</TypeCode>
    </a:Condition>
</Conditions>

My job is to swap out one of the values, retaining the entire structure of the source, and use this to submit a POST later on in the application.

The problem that I am having is that when it saves to a string or to a file, it totally messes up the namespaces:

<ns0:Context xmlns:ns0="http://Test.the.Sdk/2010/07" xmlns:ns1="http://schema.test.org/2004/07/Test.Soa.Vocab" xmlns:ns3="http://schemas.microsoft.com/2003/10/Serialization/Arrays" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:xmlns xsi:nil="true" />
<ns0:Conditions>
<ns1:Condition>
<ns1:xmlns xsi:nil="true" />
<ns0:Identifier>a23aacaf-9b6b-424f-92bb-5ab71505e3bc</ns0:Identifier>
<ns0:Name>Code</ns0:Name>
<ns0:ParameterSelections />
<ns0:ParameterSetCollections />
<ns0:Parameters />
<ns0:Summary xsi:nil="true" />
<ns0:Instance>25486d6c-36ba-4ab2-9fa6-0dbafbcf0389</ns0:Instance>
<ns0:ConditionValue>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text xsi:nil="true" />
<ns0:Default>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text>
<ns3:string>NULLCODE</ns3:string>
</ns0:Text>
</ns0:Default>
</ns0:ConditionValue>
<ns0:TypeCode>String</ns0:TypeCode>
</ns1:Condition>
<ns1:Condition>
<ns1:xmlns xsi:nil="true" />
<ns0:Identifier>0af860f6-5611-4a23-96dc-eb3863975529</ns0:Identifier>
<ns0:Name>Content Type</ns0:Name>
<ns0:ParameterSelections />
<ns0:ParameterSetCollections />
<ns0:Parameters />
<ns0:Summary xsi:nil="true" />
<ns0:Instance>6364ec20-306a-4cab-aabc-8ec65c0903c9</ns0:Instance>
<ns0:ConditionValue>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text xsi:nil="true" />
<ns0:Default>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text>
<ns3:string>Standard</ns3:string>
</ns0:Text>
</ns0:Default>
</ns0:ConditionValue>
<ns0:TypeCode>String</ns0:TypeCode>
</ns1:Condition>
</ns0:Conditions>

I've narrowed the code down to the most basic form and I'm still getting the same results so it's not anything to do with how I'm manipulating the file normally:

import xml.etree.ElementTree as ET
import requests

get_context_xml = 'http://localhost/testapi/returnxml' #returns first XML example above.
source_context_xml = requests.get(get_context_xml)

Tree = ET.fromstring(source_context_xml)

#Ensure the original namespaces are intact.
for Conditions in Tree.iter('{http://schema.test.org/2004/07/Test.Soa.Vocab}Condition'): 
    print "success"

with open('/home/memyself/output.xml','w') as f:
    f.write(ET.tostring(Tree))
Susann answered 4/8, 2015 at 9:31 Comment(1)
You tagged the question with "lxml". Did you try it? I think most if not all of the problems will go away if you do. lxml is similar to ElementTree, but leaves your namespaces alone.Aerogram
L
21

You need to register the prefix and the namespace before you do fromstring() (Reading the xml) to avoid the default namespace prefixes (like ns0 and ns1 , etc.) .

You can use the ET.register_namespace() function for that, Example -

ET.register_namespace('<prefix>','http://Test.the.Sdk/2010/07')
ET.register_namespace('a','http://schema.test.org/2004/07/Test.Soa.Vocab')

You can leave the <prefix> empty if you do not want a prefix.


Example/Demo -

>>> r = ET.fromstring('<a xmlns="blah">a</a>')
>>> ET.tostring(r)
b'<ns0:a xmlns:ns0="blah">a</ns0:a>'
>>> ET.register_namespace('','blah')
>>> r = ET.fromstring('<a xmlns="blah">a</a>')
>>> ET.tostring(r)
b'<a xmlns="blah">a</a>'
Lagoon answered 4/8, 2015 at 10:10 Comment(9)
Thanks I'm confused on what values to set for the prefixes. Looking at all the declarations throughout the original XML, how can I correlate which prefix to assign to which namespace? xmlns="http://Test.the.Sdk/2010/07" xmlns="http://schema.test.org/2004/07/Test.Soa.Vocab" xmlns:a="http://schema.test.org/2004/07/System.Xml.Serialize" xmlns:a="http://schema.test.org/2004/07/Test.Soa.Vocab" xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays" xmlns:b="http://schema.test.org/2004/07/System.Xml.Serialize" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"Susann
Assign the prefix after the : to the namespace, if no such item in the xmlns line, then set the prefix as empty. Example - b for http://schemas.microsoft.com/2003/10/Serialization/Arrays and b for http://schema.test.org/2004/07/System.Xml.Serialize . But you can also specify your own prefixes, which are more readable (the source xml seems to be using same prefix for multiple namespaces , which though valid, may not be good for readability) .Lagoon
Unfortunately I can't get it to save in the exact same format it's opened as. Now it added a larger declaration of prefixes and kept the ns0 There is no way to make the ETree just keep the formatting the way it was opened?Susann
Is it still ns0 and ns1 ? And you did add the namesapces before reading the xml right? As suggested - before you do fromstring() (Reading the xml)Lagoon
Correct, the first lines in my script are: ET.register_namespace('', 'http://Telestream.Vantage.Sdk/2010/07') ET.register_namespace('i', 'http://www.w3.org/2001/XMLSchema-instance') ET.register_namespace('', 'http://schemas.datacontract.org/2004/07/Telestream.Soa.Vocabulary') ET.register_namespace('b', 'http://schemas.datacontract.org/2004/07/System.Xml.Serialization') ET.register_namespace('a', 'http://schemas.datacontract.org/2004/07/System.Xml.Serialization') ET.register_namespace('a', 'http://schemas.datacontract.org/2004/07/Telestream.Soa.Vocabulary') Susann
I wish I could paste full XML for analysis. The character limit is hurting. It's adding the namespaces now but kept the ns0 <ns0:Context xmlns:a="http://schemas.datacontract.org/2004/07/Telestream.Soa.Vocabulary" xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays" xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns0="http://Telestream.Vantage.Sdk/2010/07">Susann
I tested, for some reason for your xml, empty prefix is not working, try putting some meaningful names in their places.Lagoon
Thank you, that seemed to clear it up. There are a few spots that still don't look right but I'll continue to experiment and see if I can get them working. This whole prefix thing and just putting a random name in there fixing it just confuses me even more though haha.Susann
I will be sure to do that once I get a final working method. I just realized that you posted a demo/sample above so I'll explore that too. I'm still working on it as there is more to this component that needs to be in place before I can test the live XML's. I'll be sure to mark the proper answer once all is resolved. Thanks so much for your help so farSusann
F
1

First off, welcome to the StackOverflow network! Technically @anand-s-kumar is correct. However there was a minor misuse of the toString function, and the fact that namespaces might not always be known by the code or the same between tags or XML files. Also, inconsistencies between the lxml and xml.etree libraries and Python 2.x and 3.x make handling this difficult.

This function iterates through all of the children elements in the XML tree tree that is passed in, and then edits the XML tags to remove the namespaces. Note that by doing this, some data may be lost.

def remove_namespaces(tree):
    for el in tree.getiterator():
        match = re.match("^(?:\{.*?\})?(.*)$", el.tag)
        if match:
            el.tag = match.group(1)

I myself just ran into this problem, and hacked together a quick solution. I tested this on about 81,000 XML files (averaging around 150 MB each) that had this problem, and all of them were fixed. Note that this isn't exactly an optimal solution, but it is relatively efficient and worked quite well for me.

CREDIT: Idea and code structure originally from Jochen Kupperschmidt.

Font answered 7/8, 2015 at 1:51 Comment(2)
Thanks and very interesting. I am going to submit a POST through a REST API and I am not sure if the receiving node will accept it without namespaces. That would be ideal if it ignores them. I'll see what I can whip up. Thanks.Susann
Unfortunately this is the only solution that worked for me. Not sure if this library is buggy, or maybe I just don't understand it, but it's really painful when working with complex xml files.Labiche

© 2022 - 2024 — McMap. All rights reserved.