Why does xml package modify my xml file in Python3?
Asked Answered
O

1

7

I use the xml library in Python3.5 for reading and writing an xml-file. I don't modify the file. Just open and write. But the library modifes the file.

  1. Why is it modified?
  2. How can I prevent this? e.g. I just want to replace specific tag or it's value in a quite complex xml-file without loosing any other informations.

This is the example file

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<movie>
    <title>Der Eisbär</title>
    <ids>
        <entry>
            <key>tmdb</key>
            <value xsi:type="xs:int" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">9321</value>
        </entry>
        <entry>
            <key>imdb</key>
            <value xsi:type="xs:string" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">tt0167132</value>
        </entry>
    </ids>
</movie>

This is the code

import xml.etree.ElementTree as ET
tree = ET.parse('x.nfo')
tree.write('y.nfo', encoding='utf-8')

And the xml-file becomes this

<movie xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <title>Der Eisbär</title>
    <ids>
        <entry>
            <key>tmdb</key>
            <value xsi:type="xs:int">9321</value>
        </entry>
        <entry>
            <key>imdb</key>
            <value xsi:type="xs:string">tt0167132</value>
        </entry>
    </ids>
</movie>
  • Line 1 is gone.
  • The <movie>-tag in line 2 has attributes now.
  • The <value>-tag in line 7 and 11 now has less attributes.
Olney answered 31/8, 2017 at 22:8 Comment(5)
In general, the short names (and where they are specified) for XML namespaces cannot be expected to be stable. But why aren't you using lxml anyway?Scevour
lxml manages to preserve the namespaces by default, although you still have to pass a flag to get the XML declaration up top.Scevour
@Scevour You mean a python package lxml? I didn't notice it. I was just using xml as search term in the python doc and found ElementTree.Olney
@Scevour lxml doesn't help, too. It does some transformations of the code, too.Olney
Well, nobody tries to preserve attribute order. So what lxml does is sort them. So even if they are changed on the first write, they will be consistent on all subsequent writes.Scevour
G
6

Note that "xml package" and "the xml library" are ambiguous. There are several XML-related modules in the standard library: https://docs.python.org/3/library/xml.html.

Why is it modified?

ElementTree moves namespace declarations to the root element and declarations for namespaces that aren't actually used in the document are removed.

Why does ElementTree do this? I don't know, but perhaps it is a way to make the implementation simpler.

How can I prevent this? e.g. I just want to replace specific tag or it's value in a quite complex xml-file without loosing any other informations.

I don't think there is a way to prevent this. The issue has been brought up before. Here are two very similar questions:

My suggestion is to use lxml instead of ElementTree. With lxml, the namespace declarations will remain where they occur in the original file.

Line 1 is gone.

That line is the XML declaration. It is recommended but not mandatory to have one.

If you always want an XML declaration, use xml_declaration=True in the write() method call.

Gantlet answered 8/9, 2017 at 6:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.