Convert a Unicode string to a string in Python (containing extra symbols)
Asked Answered
T

12

546

How do you convert a Unicode string (containing extra characters like £ $, etc.) into a Python string?

Traver answered 30/7, 2009 at 15:41 Comment(14)
What do you mean by "a python string"? Do you want to encode the unicode string?Therewith
I'm getting unicode sent from a form on a HTML window with symbols i want to be able to save to a file, but its not workingTraver
We need to know what Python version you are using, and what it is that you are calling a Unicode string. Do the following on a short unicode_string that includes the currency symbols that are causing the bother: Python 2.x : print type(unicode_string), repr(unicode_string) Python 3.x : print type(unicode_string), ascii(unicode_string) Then edit your question and copy/paste the results of the above print statement. DON'T retype the results. Also look up near the top of your HTML and see if you can find something like this: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859Thomas
I doubt the you get unicode from a web request. You probalby get UTF-8 encoded Unicode.Santoyo
The charset is currently at charset=utf-8Traver
@lutz: how exactly is "UTF-8 encoded Unicode" not unicode?Spectrophotometer
You should really clarify what you mean by unicode string and python string (giving concrete examples would be the best I guess) as it's clear from comments there are different interpretations of your question. I wonder why you haven't done this although it's over 3,5 years since you asked this question.Presumably
@jalf: If it is encoded; it is no longer Unicode e.g., unicode_string = u"I'm unicode string"; bytestring = unicode_string.encode('utf-8'); unicode_again = bytestring.decode('utf-8')Maxima
@J.F.Sebastian: You mean "it is not of the Python Unicode string datatype" (which foes without saying, because what you receive over a network socket from a HTTP request is a stream of bytes, and not a Python value), but UTF-8 text most certainly is Unicode. That is kind of the entire point in the UTF-8 encoding.Spectrophotometer
@jalf: utf-8 is a character encoding. You can use it to interpret a sequence of bytes as text (sequence of Unicode codepoints -- that you may call Unicode text (it has nothing to do with Python)). Sequence of bytes itself is not a Unicode string.Maxima
@J.F.Sebastian But we are not talking about "a sequence of bytes itself". We are talking about a string encoded as UTF-8. There is no possible way in which "a string encoded as UTF-8 is not a Unicode string, because UTF-8 is a Unicode encoding. It does not encode cars, sunsets, emotions or waffles. It encodes Unicode text. A text encoded as UTF-8 is a Unicode text. I am simply reacting to your incorrect statement that "a string which is encoded is no longer Unicode".Spectrophotometer
@wnys (plus encoding rot-13): Let's check whether an encoded string is the same as original. fyi, wnys is jalf encoded using rot-13 encoding.Maxima
Hopefully future passers-by come to understand that when you say something is "encoded" you are saying "it's not what it actually is, it's a representation of another thing in a form that we can handle with specific restrictions." E.g. using UTF-8 so that C string handling utilities "work," despite C not knowing anything of Unicode or UTF.Kalamazoo
Retagged this as a 2.x question because it is incoherent in 3.x: "a unicode string" is "a Python string" in every possible meaningful sense in 3.x. (In 2.x, str means a bytes type that is not a real string type, but really Unicode strings are still "Python strings"...).Mu
D
627

See unicodedata.normalize

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
'Kluft skrams infor pa federal electoral groe'
Douceur answered 30/7, 2009 at 15:44 Comment(23)
+1 answers the question as worded, @williamtroup's problem of not being able to save unicode to a file sounds like an entirely different issue worthy of a separate questionRugg
@John - that answer predates the OP's clarification.Dominican
@Mark Roddy: His question as written is how to convert a "Unicode string" (whatever he means by that) containing some currency symbols to a "Python string" (whatever ...) and you think that a remove-some-diacritics delete-other-non-ascii characters kludge answers his question???Thomas
@Dominic: I'm very sorry; I'll rephrase that: The OP's unclarified question said he wanted to CONVERT it TO A PYTHON STRING, not mangle it.Thomas
Note that normalize() does not handle Unicode punctuation (e.g., smart quotes, apostrophes, dashes), probably because punctuation characters are not composite characters. There's a good discussion and alternate solution here: https://mcmap.net/q/74668/-where-is-python-39-s-quot-best-ascii-for-this-unicode-quot-database-closedStrophe
This answer, as it is, without any reservations is plainly WRONG as hinted by @JohnMachin. Please consider VOTING IT DOWN.Presumably
@JohnMachin This answers the question word for word: The only way to convert a unicode string to a str is to either drop or convert the characters that cannot be represented in ASCII. So +1 from me.Vomiturition
@PiotrDobrogost See my prior comment as well.Vomiturition
@lzkata: no, it is not. type(title) == unicode and type(title.encode('utf-8')) == str. No need to corrupt the input, to get a bytestring that can be saved to a file.Maxima
Why does this have so many upvotes? And why is it the accepted answer? It's a good way to strip diacritics from Latin text, which has its uses (implementing a semi-naïve search feature, for example), but it is NOT what the OP was asking.Oblivious
this is an utter embarrassment. please do not arbitrarily destroy parts of characters in foreign languages. (this will completely remove any CJK text, for example.) fix whatever broken system is choking on them in the first place.Overblouse
@J.F.Sebastian You can save a not-encoded string to a file as well, presumably, but the problem then becomes one of retrieving it from that file. Without a standardized mechanism to interpret that file (e.g. XML with a designated character encoding) then all bets are off. These days you can assume UTF-8 but isn't assuming things what gave us 8-bit chars in the first place?Kalamazoo
What is NFKD (passed as the first argument to unicodedata.normalize)?Modena
@Modena docs.python.org/2/library/…Skelp
Well, I "missearched" and ended up here, with the exact code I want to do, so ...Erund
Please do not use this code! Completely deleting characters like a German "ß" is in no way converting. This code "converts" Fuß to Fu or groß to gro where neither Fu nor gro have any meaning in German. The same holds true for other languages where Rødgrød becomes Rdgrd.Cognate
This answer indeed doesn't answer the question.Chestonchest
@Maxima , when I title.encode('utf-8') this: u"Klüft skräms inför på fédéral électoral große", becomes this: "Kl├╝ft skr├ñms inf├╢r p├Ñ f├⌐d├⌐ral ├⌐lectoral gro├ƒe" ; is that what you meant by "bytestring"? It seems "mangled" to me; did I do something wrong? Is this the expected result?Sazerac
@TheRedPea you did wrong. The result is mojibake. To write Unicode text to a file in Python https://mcmap.net/q/74669/-writing-unicode-text-to-a-text-fileMaxima
@Maxima thank you. But I'm not interested in writing to file. OP doesn't mention writing to file. Your linked answer shows how to write a unicode object directly to file. But this question is about strings, not files; I'm referring to your earlier comment type(title.encode('utf-8')) == str can you print the result for me when you run this in Python 2.# ? u"Klüft skräms inför på fédéral électoral große".encode('utf-8') The type(..) will be str as you say, but what's the result of encode? You said "No need to corrupt the input"; how can I avoid mojibake/ corrupting input?Sazerac
I guess doing encode, i.e. u"Klüft".encode('utf-8') as @Maxima suggested will replace the ü ; the result is this string: 'Kl\xc3\xbcft'. Image here. Printing the latter string, will appear as mojibake. Decoding the latter string will go back to a unicode object: u"Klüft". Same is shown in answers below. I think this is what @Vomiturition said: "convert the characters that cannot be represented in ASCII." i.e. u'ü' -> '\xc3\xbc'Sazerac
@TheRedPea the mojibake in your comment indicates that you did write the text (unicode type on Python 2) encoded to bytes (str type on Python 2) using one character encoding to a file and then read the file using a different character encoding. If you want to print text, use unicode, don't encode to bytes prematurely (the point of the linked answer). If you want to use bytes to represent text (you shouldn't), use the same encoding for writing&reading (e.g., see sys.stdout.encoding if it is set — yes, sys.stdout is a file and you use it when you print).Maxima
"(e.g., see sys.stdout.encoding if it is set — yes, sys.stdout is a file and you use it when you print)." great, thanksSazerac
O
341

You can use encode to ASCII if you don't need to translate the non-ASCII characters:

>>> a=u"aaaàçççñññ"
>>> type(a)
<type 'unicode'>
>>> a.encode('ascii','ignore')
'aaa'
>>> a.encode('ascii','replace')
'aaa???????'
>>>
Overseas answered 31/7, 2009 at 7:13 Comment(4)
Awesome answer. Exactly what I needed. Also, great presentation to show the effect of ignore vs replaceLexicologist
or a.encode('ascii', 'xmlcharrefreplace') gives 'aaa&#224;&#231;&#231;&#231;&#241;&#241;&#241;'.Particularity
type(a) is str in Python 3.6.8 and doesn't have any encode() method.Omen
python statement: a.encode('ascii','ignore') result: b'aaa'Ceratodus
S
158
>>> text=u'abcd'
>>> str(text)
'abcd'

If the string only contains ascii characters.

Sergu answered 25/10, 2012 at 16:27 Comment(3)
This would only work on windows. And will break if there are non-ascii symbols.Microphone
This breaks if the content of the string is actually unicode, not just ascii characters in a unicode string. Don't do this, you'll get random UnicodeEncodeError exceptions all over the place.Fallal
This answer helped me. If you know that your string is ascii and you need to cast it back to a non-unicode string, this is very useful.Chenault
C
122

If you have a Unicode string, and you want to write this to a file, or other serialised form, you must first encode it into a particular representation that can be stored. There are several common Unicode encodings, such as UTF-16 (uses two bytes for most Unicode characters) or UTF-8 (1-4 bytes / codepoint depending on the character), etc. To convert that string into a particular encoding, you can use:

>>> s= u'£10'
>>> s.encode('utf8')
'\xc2\x9c10'
>>> s.encode('utf16')
'\xff\xfe\x9c\x001\x000\x00'

This raw string of bytes can be written to a file. However, note that when reading it back, you must know what encoding it is in and decode it using that same encoding.

When writing to files, you can get rid of this manual encode/decode process by using the codecs module. So, to open a file that encodes all Unicode strings into UTF-8, use:

import codecs
f = codecs.open('path/to/file.txt','w','utf8')
f.write(my_unicode_string)  # Stored on disk as UTF-8

Do note that anything else that is using these files must understand what encoding the file is in if they want to read them. If you are the only one doing the reading/writing this isn't a problem, otherwise make sure that you write in a form understandable by whatever else uses the files.

In Python 3, this form of file access is the default, and the built-in open function will take an encoding parameter and always translate to/from Unicode strings (the default string object in Python 3) for files opened in text mode.

Carousel answered 30/7, 2009 at 16:44 Comment(0)
A
60

Here is an example:

>>> u = u'€€€'
>>> s = u.encode('utf8')
>>> s
'\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac'
Abukir answered 30/7, 2009 at 15:46 Comment(1)
Can anyone explain why, when I encode the Euro symbol to utf8 as shown here, the result is only question marks? Here is an image of my Python, version 2.7.13. (I can encode other unicode objects like u"Klüft", but not the Euros?)Sazerac
V
15

In my case, the file contained unicode-esaped strings:

line = "\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0437\\u0430\\u0446\\u0438\\u044f .....\","

My solution was:

f  = open("file-json.log", encoding="utf-8")
qq = f.readline() 

print(qq)                          
# {"log":\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0440\\u0438\\u0437\\u0430\\u0446\\u0438\\u044f \\u043f\\u043e\\u043b\\u044c\\u0437\\u043e\\u0432\\u0430\\u0442\\u0435\\u043b\\u044f\"}

print(qq.encode().decode("unicode-escape").encode().decode("unicode-escape")) 
# '{"log":"message": "Авторизация пользователя"}\n'
Vardhamana answered 28/11, 2019 at 13:9 Comment(1)
it worked even if i only use: result.encode().decode('unicode-escape')Whitehall
T
6

Well, if you're willing/ready to switch to Python 3 (which you may not be due to the backwards incompatibility with some Python 2 code), you don't have to do any converting; all text in Python 3 is represented with Unicode strings, which also means that there's no more usage of the u'<text>' syntax. You also have what are, in effect, strings of bytes, which are used to represent data (which may be an encoded string).

http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

(Of course, if you're currently using Python 3, then the problem is likely something to do with how you're attempting to save the text to a file.)

Telephony answered 30/7, 2009 at 16:9 Comment(4)
In Python 3 strings are Unicode strings. They are never encoded. I found the following text useful: joelonsoftware.com/articles/Unicode.htmlSantoyo
He wants to save it to a file; how does your answer help with that?Thomas
@lutz: Right, I'd forgotten that Unicode is a character map rather than an encoding. @John: There isn't enough information at the moment to know what the problem with saving it is. Is he getting an error? Is he not getting any errors, but when opening the file externally he gets mojibake? Without that information, there are far too many possible solutions that could be provided.Telephony
@Cat: There isn't any information at the moment to know what he's got, let alone what his saving problem is. I've asked him to provide some facts -- see my answer.Thomas
A
6

There is a library that can help with Unicode issues called ftfy. Has made my life easier.

Example 1

import ftfy
print(ftfy.fix_text('ünicode'))

output -->
ünicode

Example 2 - UTF-8

import ftfy
print(ftfy.fix_text('\xe2\x80\xa2'))

output -->
•

Example 3 - Unicode code point

import ftfy
print(ftfy.fix_text(u'\u2026'))

output -->
…

https://ftfy.readthedocs.io/en/latest/

pip install ftfy

https://pypi.org/project/ftfy/

Adlib answered 16/11, 2020 at 14:10 Comment(0)
C
3

Here is an example code

import unicodedata    
raw_text = u"here $%6757 dfgdfg"
convert_text = unicodedata.normalize('NFKD', raw_text).encode('ascii','ignore')
Chapple answered 19/12, 2016 at 7:59 Comment(1)
how this answer is different from the accepted answer ?Agape
J
2

No answere worked for my case, where I had a string variable containing unicode chars, and no encode-decode explained here did the work.

If I do in a Terminal

echo "no me llama mucho la atenci\u00f3n"

or

python3
>>> print("no me llama mucho la atenci\u00f3n")

The output is correct:

output: no me llama mucho la atención

But working with scripts loading this string variable didn't work.

This is what worked on my case, in case helps anybody:

string_to_convert = "no me llama mucho la atenci\u00f3n"
print(json.dumps(json.loads(r'"%s"' % string_to_convert), ensure_ascii=False))
output: no me llama mucho la atención
Junoesque answered 5/11, 2019 at 20:40 Comment(1)
you need to import jsonJunoesque
R
1

This is my function

import unicodedata
def unicode_to_ascii(note):
    str_map = {'Š' : 'S', 'š' : 's', 'Đ' : 'D', 'đ' : 'd', 'Ž' : 'Z', 'ž' : 'z', 'Č' : 'C', 'č' : 'c', 'Ć' : 'C', 'ć' : 'c', 'À' : 'A', 'Á' : 'A', 'Â' : 'A', 'Ã' : 'A', 'Ä' : 'A', 'Å' : 'A', 'Æ' : 'A', 'Ç' : 'C', 'È' : 'E', 'É' : 'E', 'Ê' : 'E', 'Ë' : 'E', 'Ì' : 'I', 'Í' : 'I', 'Î' : 'I', 'Ï' : 'I', 'Ñ' : 'N', 'Ò' : 'O', 'Ó' : 'O', 'Ô' : 'O', 'Õ' : 'O', 'Ö' : 'O', 'Ø' : 'O', 'Ù' : 'U', 'Ú' : 'U', 'Û' : 'U', 'Ü' : 'U', 'Ý' : 'Y', 'Þ' : 'B', 'ß' : 'Ss', 'à' : 'a', 'á' : 'a', 'â' : 'a', 'ã' : 'a', 'ä' : 'a', 'å' : 'a', 'æ' : 'a', 'ç' : 'c', 'è' : 'e', 'é' : 'e', 'ê' : 'e', 'ë' : 'e', 'ì' : 'i', 'í' : 'i', 'î' : 'i', 'ï' : 'i', 'ð' : 'o', 'ñ' : 'n', 'ò' : 'o', 'ó' : 'o', 'ô' : 'o', 'õ' : 'o', 'ö' : 'o', 'ø' : 'o', 'ù' : 'u', 'ú' : 'u', 'û' : 'u', 'ý' : 'y', 'ý' : 'y', 'þ' : 'b', 'ÿ' : 'y', 'Ŕ' : 'R', 'ŕ' : 'r'}
    for key, value in str_map.items():
        note = note.replace(key, value)
    asciidata = unicodedata.normalize('NFKD', note).encode('ascii', 'ignore')
    return asciidata.decode('UTF-8')
Rooker answered 8/6, 2022 at 10:12 Comment(0)
L
1

I have made the following function which lets you control what to keep according to the General_Category_Values in Unicode (https://www.unicode.org/reports/tr44/#General_Category_Values)

def FormatToNameList(name_str):
    import unicodedata
    clean_str = ''
    for c in name_str:
        if unicodedata.category(c) in ['Lu','Ll']:
            clean_str += c.lower()
            print('normal letter: ',c)
        elif unicodedata.category(c) in ['Lt','Lm','Lo']:
            clean_str += c
            print('special letter: ',c)
        elif unicodedata.category(c) in ['Nd']:
            clean_str += c
            print('normal number: ',c)
        elif unicodedata.category(c) in ['Nl','No']:
            clean_str += c
            print('special number: ',c)
        elif unicodedata.category(c) in ['Cc','Sm','Zs','Zl','Zp','Pc','Pd','Ps','Pe','Pi','Pf','Po']:
            clean_str += ' '
            print('space or symbol: ',c)
        else:
            print('other: ',' : ',c,' unicodedata.category: ',unicodedata.category(c))    
    name_list = clean_str.split(' ')
    return clean_str, name_list
if __name__ == '__main__':
     u = 'some3^?"Weirdstr '+ chr(231) + chr(0x0af4)
     [clean_str, name_list] = FormatToNameList(u)
     print(clean_str)
     print(name_list)

See also https://docs.python.org/3/howto/unicode.html

Lemus answered 30/6, 2022 at 12:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.