Write xml utf-8 file with utf-8 data with ElementTree
Asked Answered
F

1

14

I'm trying to write an xml file with utf-8 encoded data using ElementTree like this:

#!/usr/bin/python                                                                       
# -*- coding: utf-8 -*-                                                                   

import xml.etree.ElementTree as ET
import codecs

testtag = ET.Element('unicodetag')
testtag.text = u'Töreboda' #The o is really ö (o with two dots over). No idea why SO dont display this
expfile = codecs.open('testunicode.xml',"w","utf-8-sig")
ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True)
expfile.close()

This blows up with the error

Traceback (most recent call last):
  File "unicodetest.py", line 10, in <module>
    ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
    serialize(write, self._root, encoding, qnames, namespaces)    
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/lib/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

Using the "us-ascii" encoding instead works fine, but don't preserve the unicode characters in the data. What is happening?

Forgotten answered 6/4, 2012 at 17:12 Comment(0)
P
35

codecs.open expects Unicode strings to be written to the file object and it will handle encoding to UTF-8. ElementTree's write encodes the Unicode strings to UTF-8 byte strings before sending them to the file object. Since the file object wants Unicode strings, it is coercing the byte string back to Unicode using the default ascii codec and causing the UnicodeDecodeError.

Just do this:

#expfile = codecs.open('testunicode.xml',"w","utf-8-sig")
ET.ElementTree(testtag).write('testunicode.xml',encoding="UTF-8",xml_declaration=True)
#expfile.close()
Pepperandsalt answered 6/4, 2012 at 20:9 Comment(3)
+1. Just to clarify this: the problem is that you're trying to encode unicode->utf-8 twice: ElementTree does it once, and then the codec-enabled stream tries to do it again. But this second pass gets confused since its input is already encoded (it expects a unicode string, but gets a utf-8 encoded byte string instead).Drusy
Here I go derping along thinking I'm helping by providing a unicode file... Can I just say that I LOVE stackoverflow? A perfect answer within 3 hours! Marks elaboration is explaining a lot too.Forgotten
I've been dealing with utf-8 data and received a similar errors in ElementTree._serialize_text() or _serialize_xml() when attempting to write to an xml file. I was able to solve it by converting my strings to unicode using myString.decode('utf-8') before adding them to my ET.Element object. It seems ET.ElementTree.write() is not happy with other string encodings.Contention

© 2022 - 2024 — McMap. All rights reserved.