ElementTree and unicode
Asked Answered
J

6

21

I have this char in an xml file:

<data>
  <products>
      <color>fumè</color>
  </product>
</data>

I try to generate an instance of ElementTree with the following code:

string_data = open('file.xml')
x = ElementTree.fromstring(unicode(string_data.encode('utf-8')))

and I get the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 185: ordinal not in range(128)

(NOTE: The position is not exact, I sampled the xml from a larger one).

How to solve it? Thanks

Jeane answered 10/9, 2012 at 10:23 Comment(1)
On a sidenote - your sample data is incorrect - products and productDrucill
S
12

You do not need to decode XML for ElementTree to work. XML carries it's own encoding information (defaulting to UTF-8) and ElementTree does the work for you, outputting unicode:

>>> data = '''\
... <data>
...   <products>
...       <color>fumè</color>
...   </products>
... </data>
... '''
>>> x = ElementTree.fromstring(data)
>>> x[0][0].text
u'fum\xe8'

If your data is contained in a file(like) object, just pass the filename or file object directly to the ElementTree.parse() function:

x = ElementTree.parse('file.xml')
Sande answered 10/9, 2012 at 10:35 Comment(9)
Sadly there are times when we have XML that does not have embedded encoding information and Elementree is getting it wrong, returning strs with broken characters in.Thorne
@Kylotan: then those XML documents are at fault. The XML specification is very clear about this; the document is encoded as UTF8 unless specifically stated otherwise in the XML header.Sande
@Kylotan: you can override the XML declaration with an ElementTree.XMLParser() object passed in to the ElementTree.parse() function, use that for broken XML input.Sande
@Kylotan: that doesn't make my answer incorrect just because you have broken XML, however.Sande
Well, it's hard for me to know if there's anything wrong with the XML given that it seemed to render ok elsewhere but I have had problems with the output, that's all I know. (Would undo the downvote, but SO doesn't allow me to.)Thorne
@Kylotan: make sure you didn't make any errors with the Unicode output when extracting data from the XML file either; Unicode can be tricky and are not precise about encodings.Sande
@MartijnPieters FWIW I've just hit this in Python 2.6 on Windows. Have to pass a string into a function which then uses fromstring. Source is a UTF8 file read using codecs.open(). The solution does seem to be to force a conversion to utf8 in such situations.Prohibitionist
@CharlieClark: codecs.open() produces a Unicode value, so <type 'unicode'>, not a byte string. Yes, you'd have to encode back to UTF-8.Sande
I'm only experiencing problems with xml.etree.ElementTree.fromstring() in Python 2.6 on Windows. Other platforms (Mac, Linux) don't seem to need to conversion back to an encoded string. For future reference just in case anyone else comes across the same issue.Prohibitionist
C
35

Might you have stumbled upon this problem while using Requests (HTTP for Humans), response.text decodes the response by default, you can use response.content to get the undecoded data, so ElementTree can decode it itself. Just remember to use the correct encoding.

More info: http://docs.python-requests.org/en/latest/user/quickstart/#response-content

Colmar answered 29/12, 2013 at 13:0 Comment(1)
In general, you should pass xml data (as bytes) directly to an XML parser unless response.text takes into account that response.content is XML and follows the corresponding standards e.g., reads the xml declaration fi any to find out the character encoding (it seems unlikely that requests would do that and it shouldn't).Jaguarundi
B
16

You need to decode utf-8 strings into a unicode object. So

string_data.encode('utf-8')

should be

string_data.decode('utf-8')

assuming string_data is actually an utf-8 string.

So to summarize: To get an utf-8 string from a unicode object you encode the unicode (using the utf-8 encoding), and to turn a string to a unicode object you decode the string using the respective encoding.

For more details on the concepts I suggest reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (not Python specific).

Biotic answered 10/9, 2012 at 10:30 Comment(4)
this doesn't change anything, unfortunatelyJeane
The OP's problem is that he/she is trying to handle decoding, instead of leaving it to ElementTree itself...Sande
@MartijnPieters: Absolutely, I wrote my answer while on the go, should've looked at the question a bit more carefully. While encoding a bytestring to get unicode is definitely wrong, it wasn't the (real) problem here.Biotic
You saved my life :).Triptolemus
S
12

You do not need to decode XML for ElementTree to work. XML carries it's own encoding information (defaulting to UTF-8) and ElementTree does the work for you, outputting unicode:

>>> data = '''\
... <data>
...   <products>
...       <color>fumè</color>
...   </products>
... </data>
... '''
>>> x = ElementTree.fromstring(data)
>>> x[0][0].text
u'fum\xe8'

If your data is contained in a file(like) object, just pass the filename or file object directly to the ElementTree.parse() function:

x = ElementTree.parse('file.xml')
Sande answered 10/9, 2012 at 10:35 Comment(9)
Sadly there are times when we have XML that does not have embedded encoding information and Elementree is getting it wrong, returning strs with broken characters in.Thorne
@Kylotan: then those XML documents are at fault. The XML specification is very clear about this; the document is encoded as UTF8 unless specifically stated otherwise in the XML header.Sande
@Kylotan: you can override the XML declaration with an ElementTree.XMLParser() object passed in to the ElementTree.parse() function, use that for broken XML input.Sande
@Kylotan: that doesn't make my answer incorrect just because you have broken XML, however.Sande
Well, it's hard for me to know if there's anything wrong with the XML given that it seemed to render ok elsewhere but I have had problems with the output, that's all I know. (Would undo the downvote, but SO doesn't allow me to.)Thorne
@Kylotan: make sure you didn't make any errors with the Unicode output when extracting data from the XML file either; Unicode can be tricky and are not precise about encodings.Sande
@MartijnPieters FWIW I've just hit this in Python 2.6 on Windows. Have to pass a string into a function which then uses fromstring. Source is a UTF8 file read using codecs.open(). The solution does seem to be to force a conversion to utf8 in such situations.Prohibitionist
@CharlieClark: codecs.open() produces a Unicode value, so <type 'unicode'>, not a byte string. Yes, you'd have to encode back to UTF-8.Sande
I'm only experiencing problems with xml.etree.ElementTree.fromstring() in Python 2.6 on Windows. Other platforms (Mac, Linux) don't seem to need to conversion back to an encoded string. For future reference just in case anyone else comes across the same issue.Prohibitionist
D
2

Have you tried using the parse function, instead of opening the file... (which BTW would require a .read() after it for the .fromstring() to work...)

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()
# etc...
Drucill answered 10/9, 2012 at 10:34 Comment(0)
H
1

The most likely your file is not UTF-8. è character can be from some other encoding, latin-1 for example.

Headsail answered 10/9, 2012 at 10:28 Comment(2)
i made sure that the file is save with utf-8 encoding.Jeane
Tried the encoding 'cp-1250', that didn't work. 'latin-1' did. Thanks!Megaron
J
1

Function open() does not return a string. Instead use open('file.xml').read().

Jennette answered 10/3, 2014 at 10:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.