latin-1 vs unicode in python
Asked Answered
D

1

6

I was reading this high rated post in SO on unicodes

Here is an `illustration given there :

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

and the explanation were given as

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.

My question is: why does the terminal match to the latin-1 character map when the encoding is 'UTF-8'?

Also when I tried

>>> print '\xe9'
?
>>> print u'\xe9'
é

I get different result for the first one than what is described above. why is this discrepancy and where does latin-1 come to play in this picture?

Desouza answered 19/2, 2014 at 19:18 Comment(1)
I'm not sure how the OP managed to get that output, but that is incorrect, unless the OP changed the sys.stdout.encoding value.Evaporimeter
E
7

You are missing some important context; in that case the OP configured the terminal emulator (Gnome Terminal) to interpret output as Latin-1 but left the shell variables set to UTF-8. Python thus is told by the shell to use UTF-8 for Unicode output but the actual configuration of the terminal is to expect Latin-1 bytes.

The print output clearly shows the terminal is interpreting output using Latin-1, and is not using UTF-8.

When a terminal is set to UTF-8, the \xe9 byte is not valid (incomplete) UTF-8 and your terminal usually prints a question mark instead:

>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print '\xe9'
?
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('utf8')
é

If you instruct Python to ignore such errors, it gives you the U+FFFD REPLACEMENT CHARACTER glyph instead:

>>> '\xe9'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data
>>> '\xe9'.decode('utf8', 'replace')
u'\ufffd'
>>> print '\xe9'.decode('utf8', 'replace')
�

That's because in UTF-8, \xe9 is the start byte of a 3-byte encoding, for the Unicode codepoints U+9000 through to U+9FFF, and if printed as just a single byte is invalid. This works:

>>> print '\xe9\x80\x80'
退

because that's the UTF-8 encoding of the U+9000 codepoint, a CJK Ideograph glyph.

If you want to understand the difference between encodings and Unicode, and how UTF-8 and other codecs work, I strongly recommend you read:

Evaporimeter answered 19/2, 2014 at 19:22 Comment(15)
so why is that 'xe9' and u'\xe9 prints different and the difference between latin-1 and unicode, since one of them uses one byte and other two?? (my vague understanding from that post)Desouza
I was under the impression that ? is the unicode representation but not for a invalid UTF-8,that brings to the question of how ? is represented in UTF-8. On the same note, print chr(0xFF) prints ?Tattoo
@user1988876: ? is just U+003F, or \x3f as a UTF-8, Latin1 or ASCII byte.Evaporimeter
@user1988876: any single byte outside the range 00-7F is invalid in UTF-8. After U+007F all codepoints require at least 2 bytes to encode.Evaporimeter
@MartijnPieters: Thanks for the pointing out. so would that mean it is valid in UTF-16?Tattoo
@user1988876: no, UTF-16 always uses multiples of two bytes. For most of the Unicode standard, one such pair is enough, beyond U+FFFF two pairs are used.Evaporimeter
\xe9 - how do you say that this the start byte of 3-byte encoding. any table look up? or some calculation - sorry if this sounds stupidDesouza
@eagertoLearn: see the Wikipedia article on UTF-8; UTF-8 is a variable-byte encoding.Evaporimeter
@eagertoLearn: If you look at the binary representation of hex E9, you'll see it starts with 3 bits set, one bit not set (1110 1001); this indicates it's the starting byte of a 3-byte character. A single-byte sequence starts with 0 (values 00-7F), a two-byte sequence stats with 110 (values C0-DF), 3 bytes with 1110 (E0-EF), etc.Evaporimeter
@MartijnPieters: I understand E9 is 1110 1001, but I do not see how it starts with a 3 bits set with one bit not being set.Desouza
@eagertoLearn: 1110 is three set bits (each 1), and a 0 (not set).Evaporimeter
@MartijnPieters:huh! I see what you mean, but how does that indicate its gthe starting byte of a 3-byte character. this is the confusing part. rest of your explanation is very clear. Thanks againDesouza
That's how the standard was designed; it makes the work of decoders easier.Evaporimeter
@MartijnPieters: can you give me an example of 3-byte character. I added F to make it three byte (chr(0xE9F)) but I get Value errorDesouza
@eagertoLearn: There is such a character in the answer. You want a unicode codepoint (unichr() perhaps) or produce 3 bytes, where chr() can only produce one byte. chr(0xE9) + chr(0x80) + chr(0x80) for example.Evaporimeter

© 2022 - 2024 — McMap. All rights reserved.