How to correctly parse utf-8 xml with ElementTree?

About

Asked 11/2, 2014 at 9:36 Answered 11/2, 2014 at 9:41

Solved python xml python-2.7 xml-parsing elementtree

I need help to understand why parsing my xml file* with xml.etree.ElementTree produces the following errors.

*My test xml file contains arabic characters.

Task: Open and parse utf8_file.xml file.

My first try:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_tree = etree.parse(utf8_file)

Result 1:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)

My second try:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml')
    xml_tree  = etree.fromstring(xml_string)

Result 2:

AttributeError: 'file' object has no attribute 'getiterator'

Please explain the errors above and comment on the possible solution.

Faceplate answered 11/2, 2014 at 9:36 Comment(0)

Leave decoding the bytes to the parser; do not decode first:

import xml.etree.ElementTree as etree
with open('utf8_file.xml', 'r') as xml_file:
    xml_tree = etree.parse(xml_file)

An XML file must contain enough information in the first line to handle decoding by the parser. If the header is missing, the parser must assume UTF-8 is used.

Because it is the XML header that holds this information, it is the responsibility of the parser to do all decoding.

Your first attempt failed because Python was trying to encode the Unicode values again so that the parser could handle byte strings as it expected. The second attempt failed because etree.tostring() expects a parsed tree as first argument, not a unicode string.

Argenteuil answered 11/2, 2014 at 9:41 Comment(11)

Excellent, it appeared to be easier than I thought. Even "utf-8 without BOM" files get parsed correctly. – Faceplate 11/2, 2014 at 9:48

UTF-8 without BOM is the standard; with BOM is mostly Microsoft wanting to make it easier to autodetect 8-bit encodings other than UTF-8. – Argenteuil 11/2, 2014 at 9:53

etree.parse(a_file) handles Unicode by default. However etree.fromstring(a_string) doesn't until Python 3.x (see bugs.python.org/issue11033) so to parse a string, you have to encode it manually, like etree.fromstring(a_string.encode('utf-8')). – Elviselvish 15/8, 2016 at 12:3

@ChrisJohnson: This question is about Python 2, where file objects produce byte strings, not Unicode. The question concerns the user reading data from a file and manually decoding, which is entirely pointless. – Argenteuil 15/8, 2016 at 12:5

@MartijnPieters I agree. This comment is meant to point out a non-obvious behavior for anyone looking into the string-based approach. It's non-obvious that the file-based method handles encoding by default but the string-based method requires pre-encoding. – Elviselvish 15/8, 2016 at 12:12

You can make it simpler and skip opening it as a file, I have code that does root = et.parse(sys.stdin).getroot() and it works just fine. Tested in Py3.6 – Insomniac 18/10, 2017 at 16:9

@Marcin: but that requires piping in the XML file. That's a different use case. – Argenteuil 18/10, 2017 at 16:11

Also works with sys.argv[1], I just used stdin as an example. – Insomniac 18/10, 2017 at 16:16

@Marcin: right, that's what you mean. Yes, you can pass in an open file object or a filename. – Argenteuil 18/10, 2017 at 16:20

@MartijnPieters I can see cElementTree.iterparse() also tries to decode, which in my case generates UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 293: ordinal not in range(128). I am simply passing the file object. Can I help it to decode somehow? – Kaunas 10/12, 2018 at 16:15

@TomHemmes: no, not without a traceback and example input, sorry. – Argenteuil 10/12, 2018 at 17:0

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags