Converting Unicode objects with non-ASCII symbols in them into strings objects (in Python)
Asked Answered
H

1

4

I want to send Chinese characters to be translated by an online service, and have the resulting English string returned. I'm using simple JSON and urllib for this.

And yes, I am declaring.

# -*- coding: utf-8 -*-

on top of my code.

Now everything works fine if I feed urllib a string type object, even if that object contains what would be Unicode information. My function is called translate.

For example:

stringtest1 = '無與倫比的美麗'

print translate(stringtest1)

results in the proper translation and doing

type(stringtest1) 

confirms this to be a string object.

But if do

stringtest1 = u'無與倫比的美麗'

and try to use my translation function I get this error:

  File "C:\Python27\lib\urllib.py", line 1275, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-8: ordinal not in range(128)

After researching a bit, it seems this is a common problem:

Now, if I type in a script

stringtest1 = '無與倫比的美麗' 
stringtest2 = u'無與倫比的美麗'
print 'stringtest1',stringtest1
print 'stringtest2',stringtest2

excution of it returns:

stringtest1 無與倫比的美麗
stringtest2 無與倫比的美麗

But just typing the variables in the console:

>>> stringtest1
'\xe7\x84\xa1\xe8\x88\x87\xe5\x80\xab\xe6\xaf\x94\xe7\x9a\x84\xe7\xbe\x8e\xe9\xba\x97'
>>> stringtest2
u'\u7121\u8207\u502b\u6bd4\u7684\u7f8e\u9e97'

gets me that.

My problem is that I don't control how the information to be translated comes to my function. And it seems I have to bring it in the Unicode form, which is not accepted by the function.

So, how do I convert one thing into the other?

I've read Stack Overflow question Convert Unicode to a string in Python (containing extra symbols).

But this is not what I'm after. Urllib accepts the string object but not the Unicode object, both containing the same information

Well, at least in the eyes of the web application I'm sending the unchanged information to, I'm not sure if they're are still equivalent things in Python.

Honkytonk answered 8/9, 2010 at 15:40 Comment(0)
E
8

When you get a unicode object and want to return a UTF-8 encoded byte string from it, use theobject.encode('utf8').

It seems strange that you don't know whether the incoming object is a str or unicode -- surely you do control the call sites to that function, too?! But if that is indeed the case, for whatever weird reason, you may need something like:

def ensureutf8(s):
    if isinstance(s, unicode):
        s = s.encode('utf8')
    return s

which only encodes conditionally, that is, if it receives a unicode object, not if the object it receives is already a byte string. It returns a byte string in either case.

BTW, part of your confusion seems to be due to the fact that you don't know that just entering an expression at the interpreter prompt will show you its repr, which is not the same effect you get with print;-).

Ebracteate answered 8/9, 2010 at 15:52 Comment(2)
thank you! things are getting clearer, now. And thanks so much for the extra tip. ;)Honkytonk
More robust example of method that converts to unicode can be found at https://mcmap.net/q/103237/-convert-raw-byte-string-to-unicode-without-knowing-the-codepage-beforehandFenian

© 2022 - 2024 — McMap. All rights reserved.