Creating a Unicode XML from scratch with Python 3.2

Asked 13/12, 2012 at 1:45 Answered 13/12, 2012 at 11:58

Solved python xml unicode python-3.x xml.etree

So basically, I want to generate an XML with elements generated from data in a python dictionary, where what will come to be tags are the dictionary's keys, and the text the dictionary's values. I have no need to give attributes to the items, and my desired output would look something like this:

<AllItems>

  <Item>
    <some_tag> Hello World </some_tag>
    ...
    <another_tag />
  </Item>

  <Item> ... </Item>
  ...

</AllItems>

I have tried using the xml.etree.ElementTree package, by creating a tree, setting an Element "AllItems" as the root like so:

from xml.etree import ElementTree as et

def dict_to_elem(dictionary):
    item = et.Element('Item')
    for key in dictionary:
        field = et.Element(key.replace(' ',''))
        field.text = dictionary[key]
        item.append(field)
    return item

newtree = et.ElementTree()
root = et.Element('AllItems')
newtree._setroot(root)

root.append(dict_to_elem(  {'some_tag':'Hello World', ...}  )
# Lather, rinse, repeat this append step as needed

with open(  filename  , 'w', encoding='utf-8') as file:
    tree.write(file, encoding='unicode')

In the last two lines, I have tried omitting the encoding in the open() statement, omitting and changing to 'UTF-8' the encoding in the write() method, and I either get an error that "') is type str is not serializable

So my problem - All I want to know is how should I be going about creating a UTF-8 XML from scratch with the format above, and is there a more robust solution using another package, that will properly allow me to handle UTF-8 characters? I'm not married to ElementTree for a solution, but I would prefer not to have to create a schema. Thanks in advance for any advice/solutions!

Tophus answered 13/12, 2012 at 1:45 Comment(0)

In my opinion, the ElementTree is a good choice. If you need a bit more capable package in future, you can switch to the third party lxml module that uses the same interface.

The answer to your problem can be found in the doc http://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.ElementTree.write

The output is either a string (str) or binary (bytes). This is controlled by the encoding argument. If encoding is "unicode", the output is a string; otherwise, it’s binary. Note that this may conflict with the type of file if it’s an open file object; make sure you do not try to write a string to a binary stream and vice versa.

Basically, you are doing it correctly. You open() the file in a text mode, this way the file accepts the strings and you neet to use the 'unicode' argument for the tree.write(). Otherwise, you could open the file in binary mode (no encoding argument in the open()) and use the 'utf-8' in the tree.write().

A bit cleaned-up code that works on its own:

#!python3
from xml.etree import ElementTree as et

def dict_to_elem(dictionary):
    item = et.Element('Item')
    for key in dictionary:
        field = et.Element(key.replace(' ',''))
        field.text = dictionary[key]
        item.append(field)
    return item

root = et.Element('AllItems')     # create the element first...
tree = et.ElementTree(root)       # and pass it to the created tree

root.append(dict_to_elem(  {'some_tag':'Hello World', 'xxx': 'yyy'}  ))
# Lather, rinse, repeat this append step as needed

filename = 'a.xml'
with open(filename, 'w', encoding='utf-8') as file:
    tree.write(file, encoding='unicode')

# The alternative is...    
fname = 'b.xml'
with open(fname, 'wb') as f:
    tree.write(f, encoding='utf-8')

It depends on the purpose. Of the two, I personally prefer the first solution. It clearly says that you write a text file (and the XML is a text file).

But the simplest alternative where you do not need to tell the encoding is just to pass the file name to the tree.write like this:

tree.write('c.xml', encoding='utf-8')

It opens the file, writes the content using the given encoding (updated after the Sebastian's comment below), and closes the file. And you can read it easily and you can do no mistake here.

Immanent answered 13/12, 2012 at 8:45 Comment(2)

note: tree.write() uses ascii without explicit encoding parameter. It converts all non-ascii characters to xml character references e.g., '☺' -> '&#9786'. – Liesa 13/12, 2012 at 12:6

+1. Thanks for the info, Sebastian. I did not checked that. Updated. – Immanent 13/12, 2012 at 20:25

It shouldn't be necessary but you could add xml declaration explicitly if your tool doesn't understand the generated xml file:

#!/usr/bin/env python3
from xml.etree import ElementTree as etree

your_dict = {'some_tag': 'Hello World ☺'}

def add_items(root, items):
    for name, text in items:
        elem = etree.SubElement(root, name)
        elem.text = text

root = etree.Element('AllItems')
add_items(etree.SubElement(root, 'Item'),
          ((key.replace(' ', ''), value) for key, value in your_dict.items()))
tree = etree.ElementTree(root)
tree.write('output.xml', xml_declaration=True, encoding='utf-8')

output.xml:

<?xml version='1.0' encoding='utf-8'?>
<AllItems><Item><some_tag>Hello World ☺</some_tag></Item></AllItems>

Liesa answered 13/12, 2012 at 11:58 Comment(3)

trying to run the exact same snippet, but getting an error TypeError: escape_cdata_carriage_return() missing 1 required positional argument: 'encoding' – Navicert 13/11, 2017 at 22:55

@RohithYeravothula the code in the answer works as is on Python 3.5.1 too. – Liesa 14/11, 2017 at 5:9

yeah my bad, another script was overwriting a core function of ElementTree and that caused the issue. Above snippet works absolutely fine on python 3.6 too – Navicert 14/11, 2017 at 6:9

output.xml:

Recommended topics

Hot tags