Convert a Unicode string to a string in Python (containing extra symbols)

Asked 30/7, 2009 at 15:41 Answered 30/6, 2022 at 12:38

Solved string unicode type-conversion python-2.x

546

How do you convert a Unicode string (containing extra characters like £ $, etc.) into a Python string?

Traver answered 30/7, 2009 at 15:41 Comment(14)

What do you mean by "a python string"? Do you want to encode the unicode string? – Therewith 30/7, 2009 at 15:48

I'm getting unicode sent from a form on a HTML window with symbols i want to be able to save to a file, but its not working – Traver 30/7, 2009 at 15:57

We need to know what Python version you are using, and what it is that you are calling a Unicode string. Do the following on a short unicode_string that includes the currency symbols that are causing the bother: Python 2.x : print type(unicode_string), repr(unicode_string) Python 3.x : print type(unicode_string), ascii(unicode_string) Then edit your question and copy/paste the results of the above print statement. DON'T retype the results. Also look up near the top of your HTML and see if you can find something like this: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859 – Thomas 30/7, 2009 at 16:13

I doubt the you get unicode from a web request. You probalby get UTF-8 encoded Unicode. – Santoyo 30/7, 2009 at 16:15

The charset is currently at charset=utf-8 – Traver 31/7, 2009 at 7:3

@lutz: how exactly is "UTF-8 encoded Unicode" not unicode? – Spectrophotometer 3/6, 2011 at 10:9

You should really clarify what you mean by unicode string and python string (giving concrete examples would be the best I guess) as it's clear from comments there are different interpretations of your question. I wonder why you haven't done this although it's over 3,5 years since you asked this question. – Presumably 21/1, 2013 at 12:45

@jalf: If it is encoded; it is no longer Unicode e.g.,

unicode_string = u"I'm unicode string"; bytestring = unicode_string.encode('utf-8'); unicode_again = bytestring.decode('utf-8')

– Maxima 21/12, 2013 at 1:47

@J.F.Sebastian: You mean "it is not of the Python Unicode string datatype" (which foes without saying, because what you receive over a network socket from a HTTP request is a stream of bytes, and not a Python value), but UTF-8 text most certainly is Unicode. That is kind of the entire point in the UTF-8 encoding. – Spectrophotometer 21/12, 2013 at 10:38

@jalf: utf-8 is a character encoding. You can use it to interpret a sequence of bytes as text (sequence of Unicode codepoints -- that you may call Unicode text (it has nothing to do with Python)). Sequence of bytes itself is not a Unicode string. – Maxima 21/12, 2013 at 10:57

@J.F.Sebastian But we are not talking about "a sequence of bytes itself". We are talking about a string encoded as UTF-8. There is no possible way in which "a string encoded as UTF-8 is not a Unicode string, because UTF-8 is a Unicode encoding. It does not encode cars, sunsets, emotions or waffles. It encodes Unicode text. A text encoded as UTF-8 is a Unicode text. I am simply reacting to your incorrect statement that "a string which is encoded is no longer Unicode". – Spectrophotometer 21/12, 2013 at 14:1

@wnys (plus encoding rot-13): Let's check whether an encoded string is the same as original. fyi, wnys is jalf encoded using rot-13 encoding. – Maxima 21/12, 2013 at 16:26

Hopefully future passers-by come to understand that when you say something is "encoded" you are saying "it's not what it actually is, it's a representation of another thing in a form that we can handle with specific restrictions." E.g. using UTF-8 so that C string handling utilities "work," despite C not knowing anything of Unicode or UTF. – Kalamazoo 17/9, 2015 at 23:20

Retagged this as a 2.x question because it is incoherent in 3.x: "a unicode string" is "a Python string" in every possible meaningful sense in 3.x. (In 2.x, str means a bytes type that is not a real string type, but really Unicode strings are still "Python strings"...). – Mu 24/5, 2023 at 5:45

627

See unicodedata.normalize

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
'Kluft skrams infor pa federal electoral groe'

Douceur answered 30/7, 2009 at 15:44 Comment(23)

+1 answers the question as worded, @williamtroup's problem of not being able to save unicode to a file sounds like an entirely different issue worthy of a separate question – Rugg 30/7, 2009 at 16:3

@John - that answer predates the OP's clarification. – Dominican 30/7, 2009 at 16:16

@Mark Roddy: His question as written is how to convert a "Unicode string" (whatever he means by that) containing some currency symbols to a "Python string" (whatever ...) and you think that a remove-some-diacritics delete-other-non-ascii characters kludge answers his question??? – Thomas 30/7, 2009 at 16:25

@Dominic: I'm very sorry; I'll rephrase that: The OP's unclarified question said he wanted to CONVERT it TO A PYTHON STRING, not mangle it. – Thomas 30/7, 2009 at 17:19

Note that normalize() does not handle Unicode punctuation (e.g., smart quotes, apostrophes, dashes), probably because punctuation characters are not composite characters. There's a good discussion and alternate solution here: https://mcmap.net/q/74668/-where-is-python-39-s-quot-best-ascii-for-this-unicode-quot-database-closed – Strophe 17/7, 2012 at 16:33

This answer, as it is, without any reservations is plainly WRONG as hinted by @JohnMachin. Please consider VOTING IT DOWN. – Presumably 21/1, 2013 at 12:51

@JohnMachin This answers the question word for word: The only way to convert a unicode string to a str is to either drop or convert the characters that cannot be represented in ASCII. So +1 from me. – Vomiturition 14/10, 2013 at 21:45

@PiotrDobrogost See my prior comment as well. – Vomiturition 14/10, 2013 at 21:45

@lzkata: no, it is not. type(title) == unicode and type(title.encode('utf-8')) == str. No need to corrupt the input, to get a bytestring that can be saved to a file. – Maxima 21/12, 2013 at 1:53

Why does this have so many upvotes? And why is it the accepted answer? It's a good way to strip diacritics from Latin text, which has its uses (implementing a semi-naïve search feature, for example), but it is NOT what the OP was asking. – Oblivious 27/3, 2014 at 4:4

this is an utter embarrassment. please do not arbitrarily destroy parts of characters in foreign languages. (this will completely remove any CJK text, for example.) fix whatever broken system is choking on them in the first place. – Overblouse 27/8, 2015 at 7:58

@J.F.Sebastian You can save a not-encoded string to a file as well, presumably, but the problem then becomes one of retrieving it from that file. Without a standardized mechanism to interpret that file (e.g. XML with a designated character encoding) then all bets are off. These days you can assume UTF-8 but isn't assuming things what gave us 8-bit chars in the first place? – Kalamazoo 17/9, 2015 at 23:24

What is NFKD (passed as the first argument to unicodedata.normalize)? – Modena 25/5, 2016 at 22:16

@Modena docs.python.org/2/library/… – Skelp 30/5, 2016 at 22:17

Well, I "missearched" and ended up here, with the exact code I want to do, so ... – Erund 21/7, 2016 at 23:11

Please do not use this code! Completely deleting characters like a German "ß" is in no way converting. This code "converts" Fuß to Fu or groß to gro where neither Fu nor gro have any meaning in German. The same holds true for other languages where Rødgrød becomes Rdgrd. – Cognate 11/5, 2018 at 9:58

This answer indeed doesn't answer the question. – Chestonchest 30/8, 2018 at 12:50

@Maxima , when I title.encode('utf-8') this: u"Klüft skräms inför på fédéral électoral große", becomes this: "Kl├╝ft skr├ñms inf├╢r p├Ñ f├⌐d├⌐ral ├⌐lectoral gro├ƒe" ; is that what you meant by "bytestring"? It seems "mangled" to me; did I do something wrong? Is this the expected result? – Sazerac 3/4, 2019 at 22:4

@TheRedPea you did wrong. The result is mojibake. To write Unicode text to a file in Python https://mcmap.net/q/74669/-writing-unicode-text-to-a-text-file – Maxima 4/4, 2019 at 1:18

@Maxima thank you. But I'm not interested in writing to file. OP doesn't mention writing to file. Your linked answer shows how to write a unicode object directly to file. But this question is about strings, not files; I'm referring to your earlier comment type(title.encode('utf-8')) == str can you print the result for me when you run this in Python 2.# ? u"Klüft skräms inför på fédéral électoral große".encode('utf-8') The type(..) will be str as you say, but what's the result of encode? You said "No need to corrupt the input"; how can I avoid mojibake/ corrupting input? – Sazerac 4/4, 2019 at 16:0

I guess doing encode, i.e. u"Klüft".encode('utf-8') as @Maxima suggested will replace the ü ; the result is this string: 'Kl\xc3\xbcft'. Image here. Printing the latter string, will appear as mojibake. Decoding the latter string will go back to a unicode object: u"Klüft". Same is shown in answers below. I think this is what @Vomiturition said: "convert the characters that cannot be represented in ASCII." i.e. u'ü' -> '\xc3\xbc' – Sazerac 4/4, 2019 at 16:16

@TheRedPea the mojibake in your comment indicates that you did write the text (unicode type on Python 2) encoded to bytes (str type on Python 2) using one character encoding to a file and then read the file using a different character encoding. If you want to print text, use unicode, don't encode to bytes prematurely (the point of the linked answer). If you want to use bytes to represent text (you shouldn't), use the same encoding for writing&reading (e.g., see sys.stdout.encoding if it is set — yes, sys.stdout is a file and you use it when you print). – Maxima 4/4, 2019 at 17:14

"(e.g., see sys.stdout.encoding if it is set — yes, sys.stdout is a file and you use it when you print)." great, thanks – Sazerac 4/4, 2019 at 17:24

341

You can use encode to ASCII if you don't need to translate the non-ASCII characters:

>>> a=u"aaaàçççñññ"
>>> type(a)
<type 'unicode'>
>>> a.encode('ascii','ignore')
'aaa'
>>> a.encode('ascii','replace')
'aaa???????'
>>>

Overseas answered 31/7, 2009 at 7:13 Comment(4)

Awesome answer. Exactly what I needed. Also, great presentation to show the effect of ignore vs replace – Lexicologist 11/4, 2017 at 12:19

or a.encode('ascii', 'xmlcharrefreplace') gives 'aaaàçççñññ'. – Particularity 10/4, 2019 at 17:22

type(a) is str in Python 3.6.8 and doesn't have any encode() method. – Omen 24/8, 2019 at 10:16

python statement: a.encode('ascii','ignore') result: b'aaa' – Ceratodus 21/5, 2021 at 3:38

158

>>> text=u'abcd'
>>> str(text)
'abcd'

If the string only contains ascii characters.

Sergu answered 25/10, 2012 at 16:27 Comment(3)

This would only work on windows. And will break if there are non-ascii symbols. – Microphone 30/7, 2013 at 10:50

This breaks if the content of the string is actually unicode, not just ascii characters in a unicode string. Don't do this, you'll get random UnicodeEncodeError exceptions all over the place. – Fallal 9/10, 2013 at 7:31

This answer helped me. If you know that your string is ascii and you need to cast it back to a non-unicode string, this is very useful. – Chenault 16/10, 2014 at 16:4

122

If you have a Unicode string, and you want to write this to a file, or other serialised form, you must first encode it into a particular representation that can be stored. There are several common Unicode encodings, such as UTF-16 (uses two bytes for most Unicode characters) or UTF-8 (1-4 bytes / codepoint depending on the character), etc. To convert that string into a particular encoding, you can use:

>>> s= u'£10'
>>> s.encode('utf8')
'\xc2\x9c10'
>>> s.encode('utf16')
'\xff\xfe\x9c\x001\x000\x00'

This raw string of bytes can be written to a file. However, note that when reading it back, you must know what encoding it is in and decode it using that same encoding.

When writing to files, you can get rid of this manual encode/decode process by using the codecs module. So, to open a file that encodes all Unicode strings into UTF-8, use:

import codecs
f = codecs.open('path/to/file.txt','w','utf8')
f.write(my_unicode_string)  # Stored on disk as UTF-8

Do note that anything else that is using these files must understand what encoding the file is in if they want to read them. If you are the only one doing the reading/writing this isn't a problem, otherwise make sure that you write in a form understandable by whatever else uses the files.

In Python 3, this form of file access is the default, and the built-in open function will take an encoding parameter and always translate to/from Unicode strings (the default string object in Python 3) for files opened in text mode.

Carousel answered 30/7, 2009 at 16:44 Comment(0)

Here is an example:

>>> u = u'€€€'
>>> s = u.encode('utf8')
>>> s
'\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac'

Abukir answered 30/7, 2009 at 15:46 Comment(1)

Can anyone explain why, when I encode the Euro symbol to utf8 as shown here, the result is only question marks? Here is an image of my Python, version 2.7.13. (I can encode other unicode objects like u"Klüft", but not the Euros?) – Sazerac 4/4, 2019 at 16:20

In my case, the file contained unicode-esaped strings:

line = "\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0437\\u0430\\u0446\\u0438\\u044f .....\","

My solution was:

f  = open("file-json.log", encoding="utf-8")
qq = f.readline() 

print(qq)                          
# {"log":\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0440\\u0438\\u0437\\u0430\\u0446\\u0438\\u044f \\u043f\\u043e\\u043b\\u044c\\u0437\\u043e\\u0432\\u0430\\u0442\\u0435\\u043b\\u044f\"}

print(qq.encode().decode("unicode-escape").encode().decode("unicode-escape")) 
# '{"log":"message": "Авторизация пользователя"}\n'

Vardhamana answered 28/11, 2019 at 13:9 Comment(1)

it worked even if i only use: result.encode().decode('unicode-escape') – Whitehall 15/1, 2020 at 2:33

Well, if you're willing/ready to switch to Python 3 (which you may not be due to the backwards incompatibility with some Python 2 code), you don't have to do any converting; all text in Python 3 is represented with Unicode strings, which also means that there's no more usage of the u'<text>' syntax. You also have what are, in effect, strings of bytes, which are used to represent data (which may be an encoded string).

http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

(Of course, if you're currently using Python 3, then the problem is likely something to do with how you're attempting to save the text to a file.)

Telephony answered 30/7, 2009 at 16:9 Comment(4)

In Python 3 strings are Unicode strings. They are never encoded. I found the following text useful: joelonsoftware.com/articles/Unicode.html – Santoyo 30/7, 2009 at 16:14

He wants to save it to a file; how does your answer help with that? – Thomas 30/7, 2009 at 16:15

@lutz: Right, I'd forgotten that Unicode is a character map rather than an encoding. @John: There isn't enough information at the moment to know what the problem with saving it is. Is he getting an error? Is he not getting any errors, but when opening the file externally he gets mojibake? Without that information, there are far too many possible solutions that could be provided. – Telephony 30/7, 2009 at 16:24

@Cat: There isn't any information at the moment to know what he's got, let alone what his saving problem is. I've asked him to provide some facts -- see my answer. – Thomas 30/7, 2009 at 16:35

There is a library that can help with Unicode issues called ftfy. Has made my life easier.

Example 1

import ftfy
print(ftfy.fix_text('uÌˆnicode'))

output -->
ünicode

Example 2 - UTF-8

import ftfy
print(ftfy.fix_text('\xe2\x80\xa2'))

output -->
•

Example 3 - Unicode code point

import ftfy
print(ftfy.fix_text(u'\u2026'))

output -->
…

https://ftfy.readthedocs.io/en/latest/

pip install ftfy

https://pypi.org/project/ftfy/

Adlib answered 16/11, 2020 at 14:10 Comment(0)

Here is an example code

import unicodedata    
raw_text = u"here $%6757 dfgdfg"
convert_text = unicodedata.normalize('NFKD', raw_text).encode('ascii','ignore')

Chapple answered 19/12, 2016 at 7:59 Comment(1)

how this answer is different from the accepted answer ? – Agape 30/6, 2018 at 9:51

No answere worked for my case, where I had a string variable containing unicode chars, and no encode-decode explained here did the work.

If I do in a Terminal

echo "no me llama mucho la atenci\u00f3n"

python3
>>> print("no me llama mucho la atenci\u00f3n")

The output is correct:

output: no me llama mucho la atención

But working with scripts loading this string variable didn't work.

This is what worked on my case, in case helps anybody:

string_to_convert = "no me llama mucho la atenci\u00f3n"
print(json.dumps(json.loads(r'"%s"' % string_to_convert), ensure_ascii=False))
output: no me llama mucho la atención

Junoesque answered 5/11, 2019 at 20:40 Comment(1)

you need to import json – Junoesque 5/11, 2019 at 20:41

This is my function

import unicodedata
def unicode_to_ascii(note):
    str_map = {'Š' : 'S', 'š' : 's', 'Đ' : 'D', 'đ' : 'd', 'Ž' : 'Z', 'ž' : 'z', 'Č' : 'C', 'č' : 'c', 'Ć' : 'C', 'ć' : 'c', 'À' : 'A', 'Á' : 'A', 'Â' : 'A', 'Ã' : 'A', 'Ä' : 'A', 'Å' : 'A', 'Æ' : 'A', 'Ç' : 'C', 'È' : 'E', 'É' : 'E', 'Ê' : 'E', 'Ë' : 'E', 'Ì' : 'I', 'Í' : 'I', 'Î' : 'I', 'Ï' : 'I', 'Ñ' : 'N', 'Ò' : 'O', 'Ó' : 'O', 'Ô' : 'O', 'Õ' : 'O', 'Ö' : 'O', 'Ø' : 'O', 'Ù' : 'U', 'Ú' : 'U', 'Û' : 'U', 'Ü' : 'U', 'Ý' : 'Y', 'Þ' : 'B', 'ß' : 'Ss', 'à' : 'a', 'á' : 'a', 'â' : 'a', 'ã' : 'a', 'ä' : 'a', 'å' : 'a', 'æ' : 'a', 'ç' : 'c', 'è' : 'e', 'é' : 'e', 'ê' : 'e', 'ë' : 'e', 'ì' : 'i', 'í' : 'i', 'î' : 'i', 'ï' : 'i', 'ð' : 'o', 'ñ' : 'n', 'ò' : 'o', 'ó' : 'o', 'ô' : 'o', 'õ' : 'o', 'ö' : 'o', 'ø' : 'o', 'ù' : 'u', 'ú' : 'u', 'û' : 'u', 'ý' : 'y', 'ý' : 'y', 'þ' : 'b', 'ÿ' : 'y', 'Ŕ' : 'R', 'ŕ' : 'r'}
    for key, value in str_map.items():
        note = note.replace(key, value)
    asciidata = unicodedata.normalize('NFKD', note).encode('ascii', 'ignore')
    return asciidata.decode('UTF-8')

Rooker answered 8/6, 2022 at 10:12 Comment(0)

I have made the following function which lets you control what to keep according to the General_Category_Values in Unicode (https://www.unicode.org/reports/tr44/#General_Category_Values)

def FormatToNameList(name_str):
    import unicodedata
    clean_str = ''
    for c in name_str:
        if unicodedata.category(c) in ['Lu','Ll']:
            clean_str += c.lower()
            print('normal letter: ',c)
        elif unicodedata.category(c) in ['Lt','Lm','Lo']:
            clean_str += c
            print('special letter: ',c)
        elif unicodedata.category(c) in ['Nd']:
            clean_str += c
            print('normal number: ',c)
        elif unicodedata.category(c) in ['Nl','No']:
            clean_str += c
            print('special number: ',c)
        elif unicodedata.category(c) in ['Cc','Sm','Zs','Zl','Zp','Pc','Pd','Ps','Pe','Pi','Pf','Po']:
            clean_str += ' '
            print('space or symbol: ',c)
        else:
            print('other: ',' : ',c,' unicodedata.category: ',unicodedata.category(c))    
    name_list = clean_str.split(' ')
    return clean_str, name_list
if __name__ == '__main__':
     u = 'some3^?"Weirdstr '+ chr(231) + chr(0x0af4)
     [clean_str, name_list] = FormatToNameList(u)
     print(clean_str)
     print(name_list)

Lemus answered 30/6, 2022 at 12:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags