How to read Unicode input and compare Unicode strings in Python?

Asked 25/1, 2009 at 2:19 Answered 25/1, 2009 at 10:25

I work in Python and would like to read user input (from command line) in Unicode format, ie a Unicode equivalent of raw_input?

Also, I would like to test Unicode strings for equality and it looks like a standard == does not work.

Toluol answered 25/1, 2009 at 2:19 Comment(0)

raw_input() returns strings as encoded by the OS or UI facilities. The difficulty is knowing which is that decoding. You might attempt the following:

import sys, locale
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))

which should work correctly in most of the cases.

We need more data about not working Unicode comparisons in order to help you. However, it might be a matter of normalization. Consider the following:

>>> a1= u'\xeatre'
>>> a2= u'e\u0302tre'

a1 and a2 are equivalent but not equal:

>>> print a1, a2
être être
>>> print a1 == a2
False

So you might want to use the unicodedata.normalize() method:

>>> import unicodedata as ud
>>> ud.normalize('NFC', a1)
u'\xeatre'
>>> ud.normalize('NFC', a2)
u'\xeatre'
>>> ud.normalize('NFC', a1) == ud.normalize('NFC', a2)
True

If you give us more information, we might be able to help you more, though.

Intelligibility answered 25/1, 2009 at 10:25 Comment(1)

text= raw_input().decode(sys.stdout.encoding) should be text= raw_input().decode(sys.stdin.encoding) it reads better ;) – Planer 28/8, 2011 at 12:38

It should work. raw_input returns a byte string which you must decode using the correct encoding to get your unicode object. For example, the following works for me under Python 2.5 / Terminal.app / OSX:

>>> bytes = raw_input()
日本語 Ελληνικά
>>> bytes
'\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e \xce\x95\xce\xbb\xce\xbb\xce\xb7\xce\xbd\xce\xb9\xce\xba\xce\xac'

>>> uni = bytes.decode('utf-8') # substitute the encoding of your terminal if it's not utf-8
>>> uni
u'\u65e5\u672c\u8a9e \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ac'

>>> print uni
日本語 Ελληνικά

As for comparing unicode strings: can you post an example where the comparison doesn't work?

Bradski answered 25/1, 2009 at 2:38 Comment(1)

how would you do this same thing in python3? – Shanel 19/4, 2018 at 2:21

I'm not really sure, which format you mean by "Unicode format", there are several. UTF-8? UTF-16? In any case you should be able to read a normal string with raw_input and then decode it using the strings decode method:

raw = raw_input("Please input some funny characters: ")
decoded = raw.decode("utf-8")

If you have a different input encoding just use "utf-16" or whatever instead of "utf-8". Also see the codecs modules docs for different kinds of encodings.

Comparing then should work just fine with ==. If you have string literals containing special characters you should prefix them with "u" to mark them as unicode:

if decoded == u"äöü":
  print "Do you speak German?"

And if you want to output these strings again, you probably want to encode them again in the desired encoding:

print decoded.encode("utf-8")

Krasnodar answered 25/1, 2009 at 2:42 Comment(0)

In the general case, it's probably not possible to compare unicode strings. The problem is that there are several ways to compose the same characters. A simple example is accented roman characters. Although there are codepoints for basically all of the commonly used accented characters, it is also correct to compose them from unaccented base letters and a non-spacing accent. This issue is more significant in many non-roman alphabets.

Brunhilde answered 25/1, 2009 at 3:20 Comment(1)

Before comparing, then, one can normalize or denormalize the input strings. That's what the unicodedata module is there for. – Intelligibility 25/1, 2009 at 10:13

Recommended topics

Hot tags