In Python 2.7, when passing a unicode string to ElementTree's fromstring()
method that has encoding="UTF-16"
in the XML declaration, I'm getting a ParseError saying that the encoding specified is incorrect:
>>> from xml.etree import ElementTree
>>> data = u'<?xml version="1.0" encoding="utf-16"?><root/>'
>>> ElementTree.fromstring(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1300, in XML
parser.feed(text)
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: encoding specified in XML declaration is incorrect: line 1, column 30
What does that mean? What makes ElementTree think so?
After all, I'm passing in unicode codepoints, not a byte string. There is no encoding involved here. How can it be incorrect?
Of course, one could argue that any encoding is incorrect, as these unicode codepoints are not encoded. However, then why is UTF-8 not rejected as "incorrect encoding"?
>>> ElementTree.fromstring(u'<?xml version="1.0" encoding="utf-8"?><root/>')
I can solve this problem easily either by encoding the unicode string into a UTF-16-encoded byte string and passing that to fromstring()
or by replacing encoding="utf-16"
with encoding="utf-8"
in the unicode string, but I would like to understand why that exception is raised. The documentation of ElementTree says nothing about only accepting byte strings.
Specifically, I would like to avoid these additional operations because my input data can get quite big and I would like to avoid having them twice in memory and the CPU overhead of processing them more than absolutely necessary.