Python: Sanitize a string for unicode? [duplicate]

Asked 11/7, 2010 at 19:54 Answered 11/7, 2010 at 22:6

Possible Duplicate:
Python UnicodeDecodeError - Am I misunderstanding encode?

I have a string that I'm trying to make safe for the unicode() function:

>>> s = " foo “bar bar ” weasel"
>>> s.encode('utf-8', 'ignore')

Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    s.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
>>> unicode(s)

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    unicode(s)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)

I'm mostly flailing around here. What do I need to do to remove the unsafe characters from the string?

Somewhat related to this question, although I was unable to solve my problem from it.

This also fails:

>>> s
' foo \x93bar bar \x94 weasel'
>>> s.decode('utf-8')

Traceback (most recent call last):
  File "<pyshell#13>", line 1, in <module>
    s.decode('utf-8')
  File "C:\Python25\254\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 5: unexpected code byte

Hobnailed answered 11/7, 2010 at 19:54 Comment(3)

I'm wondering why str has an encode function at all, and whether the "encoding" parameter specifies the result's encoding, or the input's encoding. What exactly are you attempting to do here? – Eyecup 11/7, 2010 at 20:1

Please check this answer to a related question: “Python UnicodeDecodeError - Am I misunderstanding encode?” – Jazminejazz 11/7, 2010 at 22:37

For those hunting a solution to sanitizing unicode special characters into (X)HTML, try u'my unicode str'.encode('ascii','xmlcharrefreplace'). – Clancy 13/2, 2014 at 20:23

Good question. Encoding issues are tricky. Let's start with "I have a string." Strings in Python 2 aren't really "strings," they're byte arrays. So your string, where did it come from and what encoding is it in? Your example shows curly quotes in the literal, and I'm not even sure how you did that. I try to paste it into a Python interpreter, or type it on OS X with Option-[, and it doesn't come through.

Looking at your second example though, you have a character of hex 93. That can't be UTF-8, because in UTF-8, any byte higher than 127 is part of a multibyte sequence. So I'm guessing it's supposed to be Latin-1. The problem is, x93 isn't a character in the Latin-1 character set. There's this "invalid" range in Latin-1 from x7f to x9f that's considered illegal. However, Microsoft saw that unused range and decided to put "curly quotes" in there. In doing so they created this similar encoding called "windows-1252", which is like Latin-1 with stuff in that invalid range.

So, let's assume it is windows-1252. What now? String.decode converts bytes into Unicode, so that's the one you want. Your second example was on the right track, but it failed because the string wasn't UTF-8. Try:

>>> uni = 'foo \x93bar bar\x94 weasel'.decode("windows-1252")
u'foo \u201cbar bar\u201d weasel'
>>> print uni
foo “bar bar” weasel
>>> type(uni)
<type 'unicode'>

That's correct, because opening curly quote is Unicode U+201C. Now that you have Unicode, you can serialize it to bytes in any encoding you choose (if you need to pass it across the wire) or just keep it as Unicode if it's staying within Python. If you want to convert to UTF-8, use the oppose function, string.encode.

>>> uni.encode("utf-8")
'foo \xe2\x80\x9cbar bar \xe2\x80\x9d weasel'

Curly quotes take 3 bytes to encode in UTF-8. You could use UTF-16 and they'd only be two bytes. You can't encode as ASCII or Latin-1 though, because those don't have curly quotes.

Gusgusba answered 11/7, 2010 at 22:6 Comment(9)

+1, but you should also mention that this answer is specific to Python 2.x. In 3.x, the str type gets renamed to bytes and unicode gets renamed to str. While confusing at first, this change makes this kind of thing less likely to happen. – Whitaker 11/7, 2010 at 22:42

+1 for "let's start with 'I have a string'" haha – Hobnailed 11/7, 2010 at 22:44

@Daniel Not to be incestuous but I just voted up your vote-up explanation. It's true: the above is Python 2.x specific. – Gusgusba 11/7, 2010 at 22:50

I'd also mention that this behaviour depends on what encoding your source file is in. If the source file was saved as utf-8, then you'd indeed want to decode it as utf-8. (darkporter's example bypassed this minor complication by using hex escapes directly). – Zeb 11/7, 2010 at 23:21

@A Good point. What encoding is used for the Python interactive console? – Gusgusba 11/7, 2010 at 23:31

@darkporter: On Windows for the en_US locale, it's IBM437. On Linux, it's usually UTF-8. – Marbling 12/7, 2010 at 3:1

'\x80' - '\x9F' are defined in Latin-1. They're the C1 control characters that nobody uses. '\x93' is "Set Transmit State". – Marbling 12/7, 2010 at 3:4

Wikipedia goes into plenty of detail about the sub-versions of Latin-1. If by Latin-1 you mean "iso-8859-1" then it appears you're right. But if not "defined" then certainly "unprintable." – Gusgusba 12/7, 2010 at 3:30

When you're running under Windows you can sometimes use 'mbcs' instead of an explicit code page. – Capelin 3/11, 2015 at 0:6

EDIT. Looks like your string is encoded in such a way that “ (LEFT DOUBLE QUOTATION MARK) becomes \x93 and ” (RIGHT DOUBLE QUOTATION MARK) becomes \x94. There is a number of codepages with such a mapping, CP1250 is one of them, so you may use this:

s = s.decode('cp1250')

For all the codepages which map “ to \x93 see here (all of them also map ” to \x94, which can be verified here).

Tetraspore answered 11/7, 2010 at 20:8 Comment(3)

That call fails for me (see above) – Hobnailed 11/7, 2010 at 21:10

@Rosarch OK, now I see the original string. I've updated the answer (and in the meantime @darkporter had come up with the same solution). – Tetraspore 11/7, 2010 at 22:13

Nice link on the code pages. Looks like they're all variations on "windows" though. If you're "Western" I'd say just stick with 1252. – Gusgusba 11/7, 2010 at 22:23

Recommended topics

Hot tags