How to correctly parse utf-8 xml with ElementTree?
Asked Answered
F

1

17

I need help to understand why parsing my xml file* with xml.etree.ElementTree produces the following errors.

*My test xml file contains arabic characters.

Task: Open and parse utf8_file.xml file.

My first try:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_tree = etree.parse(utf8_file)

Result 1:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)

My second try:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml')
    xml_tree  = etree.fromstring(xml_string)

Result 2:

AttributeError: 'file' object has no attribute 'getiterator'

Please explain the errors above and comment on the possible solution.

Faceplate answered 11/2, 2014 at 9:36 Comment(0)
A
26

Leave decoding the bytes to the parser; do not decode first:

import xml.etree.ElementTree as etree
with open('utf8_file.xml', 'r') as xml_file:
    xml_tree = etree.parse(xml_file)

An XML file must contain enough information in the first line to handle decoding by the parser. If the header is missing, the parser must assume UTF-8 is used.

Because it is the XML header that holds this information, it is the responsibility of the parser to do all decoding.

Your first attempt failed because Python was trying to encode the Unicode values again so that the parser could handle byte strings as it expected. The second attempt failed because etree.tostring() expects a parsed tree as first argument, not a unicode string.

Argenteuil answered 11/2, 2014 at 9:41 Comment(11)
Excellent, it appeared to be easier than I thought. Even "utf-8 without BOM" files get parsed correctly.Faceplate
UTF-8 without BOM is the standard; with BOM is mostly Microsoft wanting to make it easier to autodetect 8-bit encodings other than UTF-8.Argenteuil
etree.parse(a_file) handles Unicode by default. However etree.fromstring(a_string) doesn't until Python 3.x (see bugs.python.org/issue11033) so to parse a string, you have to encode it manually, like etree.fromstring(a_string.encode('utf-8')).Elviselvish
@ChrisJohnson: This question is about Python 2, where file objects produce byte strings, not Unicode. The question concerns the user reading data from a file and manually decoding, which is entirely pointless.Argenteuil
@MartijnPieters I agree. This comment is meant to point out a non-obvious behavior for anyone looking into the string-based approach. It's non-obvious that the file-based method handles encoding by default but the string-based method requires pre-encoding.Elviselvish
You can make it simpler and skip opening it as a file, I have code that does root = et.parse(sys.stdin).getroot() and it works just fine. Tested in Py3.6Insomniac
@Marcin: but that requires piping in the XML file. That's a different use case.Argenteuil
Also works with sys.argv[1], I just used stdin as an example.Insomniac
@Marcin: right, that's what you mean. Yes, you can pass in an open file object or a filename.Argenteuil
@MartijnPieters I can see cElementTree.iterparse() also tries to decode, which in my case generates UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 293: ordinal not in range(128). I am simply passing the file object. Can I help it to decode somehow?Kaunas
@TomHemmes: no, not without a traceback and example input, sorry.Argenteuil

© 2022 - 2024 — McMap. All rights reserved.