Decode function tries to encode Python
Asked Answered
P

3

7

I am trying to print a unicode string without the specific encoding hex in it. I'm grabbing this data from facebook which has an encoding type in the html headers of UTF-8. When I print the type - it says its unicode, but then when I try to decode it with unicode-escape says there is an encoding error. Why is it trying to encode when I use the decode method?

Code

a='really long string of unicode html text that i wont reprint'
print type(a)
 >>> <type 'unicode'>   
print a.decode('unicode-escape')
 >>> Traceback (most recent call last):
  File "scfbp.py", line 203, in myFunctionPage
    print a.decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 1945: ordinal not in range(128)
Prowess answered 25/1, 2011 at 23:14 Comment(0)
D
8

It's not the decode that's failing. It's because you are trying to display the result to the console. When you use print it encodes the string using the default encoding which is ASCII. Don't use print and it should work.

>>> a=u'really long string containing \\u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
u'really long string containing \u20ac and some other text'
>>> print a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)

I'd recommend using IDLE or some other interpreter that can output unicode, then you won't get this problem.


Update: Note that this is not the same as the situtation with one less backslash, where it fails during the decode, but with the same error message:

>>> a=u'really long string containing \u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)
Devaney answered 25/1, 2011 at 23:17 Comment(8)
This is not the reason for his error. He is trying to decode a unicode object. Since you decode from binary data to unicode data, Python 2 will first encode it, which it does with the ascii codec. That's what is failing.Mignonne
@Lennart Regebro: Actually I suspect that the actual type of his string is str, not unicode. Look at how he is initializing the string - notice there is no u. I think what he has is not a unicode string, but a unicode-escaped string (not the same!). It is this which he is trying to decode to unicode. If my theory is right then I think this answer is actually correct.Devaney
@Mark Byers: True, that's inconsistent, but missing a u is easier than typing the wrong type. :) And the error is consistent with what he does. If you decode a unicode object, you get an encode error.Mignonne
@Lennart: But he's also using decode instead of encode, which would imply he's starting with a str, not a unicode. And also the error message is consistent with my answer, isn't it? But I agree the question is incredibly confusing and missing important information.Devaney
@Lennart Regebro: Also in his question he states: I am trying to print a unicode string without the specific encoding hex in it. which to me implies that he is trying to convert encoded data to a human-readable (unicode) string so that it displays correctly (as characters, not codes). Using decode is indeed the correct way to do this. If you notice, the "interactive console" output is clearly faked.Devaney
@Lennart Regebro: Also, even if the string is in fact a unicode string, it could still contain only characters that are valid ASCII. For example it might be the string u'\\u2110' - this is a string containing six unicode characters (not one - notice the backslash is escaped). Then the automatic conversion from unicode to str will succeed, but the print will still fail with the error he gets.Devaney
@Mark Byers: If the terminal encoding is ascii, then possibly he would get this error even if he is correctly printing a unicode string (not sure how to test that though, as I don't have an ascii-only terminal at hand). It seems a bit unlikely though. Compared with him just being unicode-confused, which happens to most people. :)Mignonne
@Lennart Regebro: I've added some more details to my answer. As you can see the error message can be the result of one of two entirely different situations depending on whether the original string contains unicode characters or unicode escape codes. I don't think you are more able to say which is the "correct" answer - only the OP knows that. But given all the information in his question I personally think the most likely explanation is the one I gave originally. And I don't think there is sufficient evidence to claim that my answer is wrong based on the information in the question alone.Devaney
A
3

When you print to the console Python tries to encode (convert) the string to the character set of your terminal. If this is not UTF-8, or something that doesn't map all the characters in the string, it will whine and throw an exception.

This snags me every now and then when I do quick processing of data, with for example Turkish characters in it.

If you are running python.exe through the Windows command prompt you can find some solutions here: What encoding/code page is cmd.exe using. Basically you can change the codepage with chcp but it's quite cumbersome. I would follow Mark's advice and use something like IDLE.

Athal answered 25/1, 2011 at 23:20 Comment(0)
M
2
>>> print type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')

Why is it trying to encode when I use the decode method?

Because you decode to Unicode, and you encode from. You just tried to decode a unicode string to unicode. The first thing it then does is try to convert it to a string, with the ascii codec. That's why you get:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2110' in position 3: ordinal not in range(128)

Remember: Unicode is not an encoding. Everything else is, like ascii, utf8, latin-1 etc.

This implicit encoding is gone in Python 3, btw, because it confuses people.

Mignonne answered 26/1, 2011 at 11:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.