Why does ElementTree reject UTF-16 XML declarations with "encoding incorrect"?

>>> from xml.etree import ElementTree >>> data = u'<?xml version="1.0" encoding="utf-16"?><root/>' >>> ElementTree.fromstring(data) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1300, in XML parser.feed(text) File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1642, in feed self._raiseerror(v) File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror raise err xml.etree.ElementTree.ParseError: encoding specified in XML declaration is incorrect: line 1, column 30

I'm not going to try to justify the behavior, but to explain why it's actually happening with the code as written.

In short: the XML parser that Python uses, expat, operates on bytes, not unicode characters. You MUST call .encode('utf-16-be') or .encode('utf-16-le') on the string before you pass it to ElementTree.fromstring:

ElementTree.fromstring(data.encode('utf-16-be'))

Proof: ElementTree.fromstring eventually calls down into pyexpat.xmlparser.Parse, which is implemented in pyexpat.c:

static PyObject *
xmlparse_Parse(xmlparseobject *self, PyObject *args)
{
    char *s;
    int slen;
    int isFinal = 0;

    if (!PyArg_ParseTuple(args, "s#|i:Parse", &s, &slen, &isFinal))
        return NULL;

    return get_parse_result(self, XML_Parse(self->itself, s, slen, isFinal));
}

So the unicode parameter you passed in gets converted using s#. The docs for PyArg_ParseTuple say:

s# (string, Unicode or any read buffer compatible object) [const char *, int (or Py_ssize_t, see below)] This variant on s stores into two C variables, the first one a pointer to a character string, the second one its length. In this case the Python string may contain embedded null bytes. Unicode objects pass back a pointer to the default encoded string version of the object if such a conversion is possible. All other read-buffer compatible objects pass back a reference to the raw internal data representation.

Let's check this out:

from xml.etree import ElementTree
data = u'<?xml version="1.0" encoding="utf-8"?><root>\u2163</root>'
print ElementTree.fromstring(data)

gives the error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2163' in position 44: ordinal not in range(128)

which means that when you were specifying encoding="utf-8", you were just getting lucky that there weren't non-ASCII characters in your input when the Unicode string got encoded to ASCII. If you add the following before you parse, UTF-8 works as expected with that example:

import sys
reload(sys).setdefaultencoding('utf8')

however, it doesn't work to set the defaultencoding to 'utf-16-be' or 'utf-16-le', since the Python bits of ElementTree do direct string comparisons which start to fail in UTF-16 land.

Recommended topics

Hot tags