Python UnicodeDecodeError - Am I misunderstanding encode?
Asked Answered
W

4

56

Any thoughts on why this isn't working? I really thought 'ignore' would do the right thing.

>>> 'add \x93Monitoring\x93 to list '.encode('latin-1','ignore')
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 4: ordinal not in range(128)
Whaley answered 15/12, 2008 at 15:57 Comment(1)
I also wrote a long blog about this subject: The Hassle of Unicode and Getting on With ItSkiles
S
214

… There's a reason they're called "encodings" …

A little preamble: think of unicode as the norm, or the ideal state. Unicode is just a table of characters. №65 is latin capital A. №937 is greek capital omega. Just that.

In order for a computer to store and-or manipulate Unicode, it has to encode it into bytes. The most straightforward encoding of Unicode is UCS-4; every character occupies 4 bytes, and all ~1000000 characters are available. The 4 bytes contain the number of the character in the Unicode tables as a 4-byte integer. Another very useful encoding is UTF-8, which can encode any Unicode character with one to four bytes. But there also are some limited encodings, like "latin1", which include a very limited range of characters, mostly used by Western countries. Such encodings use only one byte per character.

Basically, Unicode can be encoded with many encodings, and encoded strings can be decoded to Unicode. The thing is, Unicode came quite late, so all of us that grew up using an 8-bit character set learned too late that all this time we worked with encoded strings. The encoding could be ISO8859-1, or windows CP437, or CP850, or, or, or, depending on our system default.

So when, in your source code, you enter the string "add “Monitoring“ to list" (and I think you wanted the string "add “Monitoring” to list", note the second quote), you actually are using a string already encoded according to your system's default codepage (by the byte \x93 I assume you use Windows codepage 1252, “Western”). If you want to get Unicode from that, you need to decode the string from the "cp1252" encoding.

So, what you meant to do, was:

"add \x93Monitoring\x94 to list".decode("cp1252", "ignore")

It's unfortunate that Python 2.x includes an .encode method for strings too; this is a convenience function for "special" encodings, like the "zip" or "rot13" or "base64" ones, which have nothing to do with Unicode.

Anyway, all you have to remember for your to-and-fro Unicode conversions is:

  • a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes)
  • a Python 2.x string gets decoded to a Unicode string

In both cases, you need to specify the encoding that will be used.

I'm not very clear, I'm sleepy, but I sure hope I help.

PS A humorous side note: Mayans didn't have Unicode; ancient Romans, ancient Greeks, ancient Egyptians didn't too. They all had their own "encodings", and had little to no respect for other cultures. All these civilizations crumbled to dust. Think about it people! Make your apps Unicode-aware, for the good of mankind. :)

PS2 Please don't spoil the previous message by saying "But the Chinese…". If you feel inclined or obligated to do so, though, delay it by thinking that the Unicode BMP is populated mostly by chinese ideograms, ergo Chinese is the basis of Unicode. I can go on inventing outrageous lies, as long as people develop Unicode-aware applications.

Saintsimonianism answered 16/12, 2008 at 0:45 Comment(15)
Unicode is not just a table of characters e.g., a single abstract character may be represented by a sequence of code points: latin capital letter g with acute (corresponding coded character u"\u01F4" or '&#500;') is represented by the sequence u"\u0047\u0301" (or '&#71;&#769;'). is.gd/eTLi-Footgear
@J.F. Sebastian: no, Unicode isn't just a table of characters. I oversimplified things just for the purposes of this answer.Saintsimonianism
Nice answer guy with the Omega in his name. I just answered a similar question but hadn't seen your answer yet.Learn
Also, I believe UTF-8 uses 1 to 6 bytes. There are 2^32 characters possible, but the encoding itself has some overhead for tracking multibyte sequence length.Learn
@darkporter: yes, UTF-8 in theory could use up to 6 bytes, iff the Unicode standard used the complete 32 bit range for characters. Currently, though, the maximum Unicode character is U+10FFFF, and all Unicode characters need 4 bytes at the most when encoded as UTF-8.Saintsimonianism
@system: On most non-Windows operating systems, an ideal "Unicode string" is represented encoded as UTF-32, while on MS Windows systems it's represented encoded as UCS-2 (or maybe UTF-16, if there is any difference now and if MS currently supports surrogate pairs correctly.) What's your objection and where's the wrong concept?Saintsimonianism
@system: just like SQLite in that example, you think you work with Unicode strings, you think you store them as Unicode in the database, but there are many layers that you don't know about. So, again: a computer needs to encode a Unicode string into bytes before it does anything with it.Saintsimonianism
A very useful general UTF-8 related article is What Is UTF-8 And Why Is It Important?.Saintsimonianism
The description above is wrong. On the way into your program, you decode external UTF-anything byte sequences into internal Python Unicode strings comprising logical characters, and then later on the way out of your program, you encode those Python abstract Unicode strings into UTF-anything byte sequences. Not the other way around.Christabelle
@tchrist: I fail to see what is the disagreement between what you write and what I wrote. Care to pinpoint exactly where you think I described things “the other way around”?Saintsimonianism
Unicode strings do not “get encoded to a Python string”, unless just maybe you are meaning the str sense. External UTF-something bytestrings strings get decoded in a Python Unicode string on input, and then Python Unicode strings get encoded into some UTF-something on output. env PYTHONIOENCODING=utf8 python -c 'print("Jose\u0301")' prints José.Christabelle
@tchrist: the two bulleted lines in my answer should be enough to clarify your “unless just maybe” (note they mention Python 2.x). Please refrain from judging any text wrong unless you've read it and grokked it first. Also, note that encodings are not restricted to UTF-something (e.g. Windows and their CP#### encodings) on input and output conversions.Saintsimonianism
The “unless just maybe” is hardly clear; hence my initial comment. I wish you had used code escapes so it was clear when you were talking about a Python internal type, not just the regular English word. Plus I use Python 3, not Python 2.Christabelle
@tchrist: In 2008 Python 3 was much less common than it is today, and yet I still made a note that my answer was about Python 2, even though it's implied by the exception reported in the original question.Saintsimonianism
Regarding ancient cultures and their lack of unicode: העבריים הקדמונים הסתדרו ללא יוניקוד, אך אכן פוזרו לכל קצות העולם. חזרנו הביתה רק באותה תקופה שבאה הומצאה היוניקוד!‏Incubation
A
4

encode is available to unicode strings, but the string you have there does not seems unicode (try with u'add \x93Monitoring\x93 to list ')

>>> u'add \x93Monitoring\x93 to list '.encode('latin-1','ignore')
'add \x93Monitoring\x93 to list '
Ardent answered 15/12, 2008 at 16:2 Comment(2)
well the string is coming in that way as non unicode. So I need to do something to the string.Whaley
This means that the string you get has already been encoded. In the example below, you simply decode and encode again - assuming a latin-1 encoding (and this may not always be true). I think you can simply go on with your string, and letting the output handling it correctly.Ardent
S
-2

And the magic line is:

unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')

The one liner that wont raise exceptions when it is most needed (remove bad Unicode characters...)

Selfknowledge answered 15/12, 2008 at 15:57 Comment(0)
W
-3

This seems to work:

'add \x93Monitoring\x93 to list '.decode('latin-1').encode('latin-1')

Any issues with that? I wonder when 'ignore', 'replace' and other such encode error handling comes in?

Whaley answered 15/12, 2008 at 16:10 Comment(5)
It comes in when you want to encode a unicode string that contains code points that are not representable in your choosen encoding, i.e. chinese characters in latin1. You can then specify how the encoding should react to such code points.Anglian
As said above, this is doing nothing. You are passing through a function, then on its reverse. The final string is in the best case the very same as the original; in the worst you have issues like those outlined by Heiko.Ardent
Seems to work?? str_object.decode('latin1').encode('latin1') == str_object FOR ALL STR OBJECTS. In other words, it does exactly nothing.Caren
It does nothing for Latin-1. It's different for encodings for which arbitrary byte sequences aren't always valid, or have multiple encodings of the same character.Karsten
If you have to do a manual encode and/or decode, you’re doing something wrong.Christabelle

© 2022 - 2024 — McMap. All rights reserved.