You are missing some important context; in that case the OP configured the terminal emulator (Gnome Terminal) to interpret output as Latin-1 but left the shell variables set to UTF-8. Python thus is told by the shell to use UTF-8 for Unicode output but the actual configuration of the terminal is to expect Latin-1 bytes.
The print
output clearly shows the terminal is interpreting output using Latin-1, and is not using UTF-8.
When a terminal is set to UTF-8, the \xe9
byte is not valid (incomplete) UTF-8 and your terminal usually prints a question mark instead:
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print '\xe9'
?
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('utf8')
é
If you instruct Python to ignore such errors, it gives you the U+FFFD REPLACEMENT CHARACTER glyph �
instead:
>>> '\xe9'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data
>>> '\xe9'.decode('utf8', 'replace')
u'\ufffd'
>>> print '\xe9'.decode('utf8', 'replace')
�
That's because in UTF-8, \xe9
is the start byte of a 3-byte encoding, for the Unicode codepoints U+9000 through to U+9FFF, and if printed as just a single byte is invalid. This works:
>>> print '\xe9\x80\x80'
退
because that's the UTF-8 encoding of the U+9000 codepoint, a CJK Ideograph glyph.
If you want to understand the difference between encodings and Unicode, and how UTF-8 and other codecs work, I strongly recommend you read:
sys.stdout.encoding
value. – Evaporimeter