How do you convert a Unicode string (containing extra characters like £ $, etc.) into a Python string?
title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
'Kluft skrams infor pa federal electoral groe'
unicode
string to a str
is to either drop or convert the characters that cannot be represented in ASCII. So +1 from me. –
Vomiturition type(title) == unicode and type(title.encode('utf-8')) == str
. No need to corrupt the input, to get a bytestring that can be saved to a file. –
Maxima NFKD
(passed as the first argument to unicodedata.normalize
)? –
Modena Fuß
to Fu
or groß
to gro
where neither Fu
nor gro
have any meaning in German. The same holds true for other languages where Rødgrød
becomes Rdgrd
. –
Cognate title.encode('utf-8')
this: u"Klüft skräms inför på fédéral électoral große"
, becomes this: "Klüft skräms inför på fédéral électoral große"
; is that what you meant by "bytestring"? It seems "mangled" to me; did I do something wrong? Is this the expected result? –
Sazerac type(title.encode('utf-8')) == str
can you print the result for me when you run this in Python 2.# ? u"Klüft skräms inför på fédéral électoral große".encode('utf-8')
The type(..)
will be str
as you say, but what's the result of encode
? You said "No need to corrupt the input"; how can I avoid mojibake/ corrupting input? –
Sazerac encode
, i.e. u"Klüft".encode('utf-8')
as @Maxima suggested will replace the ü
; the result is this string: 'Kl\xc3\xbcft'
. Image here. Printing the latter string, will appear as mojibake. Decoding the latter string will go back to a unicode
object: u"Klüft"
. Same is shown in answers below. I think this is what @Vomiturition said: "convert the characters that cannot be represented in ASCII." i.e. u'ü'
-> '\xc3\xbc'
–
Sazerac sys.stdout.encoding
if it is set — yes, sys.stdout
is a file and you use it when you print). –
Maxima You can use encode to ASCII if you don't need to translate the non-ASCII characters:
>>> a=u"aaaàçççñññ"
>>> type(a)
<type 'unicode'>
>>> a.encode('ascii','ignore')
'aaa'
>>> a.encode('ascii','replace')
'aaa???????'
>>>
ignore
vs replace
–
Lexicologist a.encode('ascii', 'xmlcharrefreplace')
gives 'aaaàçççñññ'
. –
Particularity type(a)
is str
in Python 3.6.8 and doesn't have any encode()
method. –
Omen >>> text=u'abcd'
>>> str(text)
'abcd'
If the string only contains ascii characters.
If you have a Unicode string, and you want to write this to a file, or other serialised form, you must first encode it into a particular representation that can be stored. There are several common Unicode encodings, such as UTF-16 (uses two bytes for most Unicode characters) or UTF-8 (1-4 bytes / codepoint depending on the character), etc. To convert that string into a particular encoding, you can use:
>>> s= u'£10'
>>> s.encode('utf8')
'\xc2\x9c10'
>>> s.encode('utf16')
'\xff\xfe\x9c\x001\x000\x00'
This raw string of bytes can be written to a file. However, note that when reading it back, you must know what encoding it is in and decode it using that same encoding.
When writing to files, you can get rid of this manual encode/decode process by using the codecs module. So, to open a file that encodes all Unicode strings into UTF-8, use:
import codecs
f = codecs.open('path/to/file.txt','w','utf8')
f.write(my_unicode_string) # Stored on disk as UTF-8
Do note that anything else that is using these files must understand what encoding the file is in if they want to read them. If you are the only one doing the reading/writing this isn't a problem, otherwise make sure that you write in a form understandable by whatever else uses the files.
In Python 3, this form of file access is the default, and the built-in open
function will take an encoding parameter and always translate to/from Unicode strings (the default string object in Python 3) for files opened in text mode.
Here is an example:
>>> u = u'€€€'
>>> s = u.encode('utf8')
>>> s
'\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac'
utf8
as shown here, the result is only question marks? Here is an image of my Python, version 2.7.13. (I can encode other unicode objects like u"Klüft"
, but not the Euros?) –
Sazerac In my case, the file contained unicode-esaped strings:
line = "\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0437\\u0430\\u0446\\u0438\\u044f .....\","
My solution was:
f = open("file-json.log", encoding="utf-8")
qq = f.readline()
print(qq)
# {"log":\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0440\\u0438\\u0437\\u0430\\u0446\\u0438\\u044f \\u043f\\u043e\\u043b\\u044c\\u0437\\u043e\\u0432\\u0430\\u0442\\u0435\\u043b\\u044f\"}
print(qq.encode().decode("unicode-escape").encode().decode("unicode-escape"))
# '{"log":"message": "Авторизация пользователя"}\n'
result.encode().decode('unicode-escape')
–
Whitehall Well, if you're willing/ready to switch to Python 3 (which you may not be due to the backwards incompatibility with some Python 2 code), you don't have to do any converting; all text in Python 3 is represented with Unicode strings, which also means that there's no more usage of the u'<text>'
syntax. You also have what are, in effect, strings of bytes, which are used to represent data (which may be an encoded string).
http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
(Of course, if you're currently using Python 3, then the problem is likely something to do with how you're attempting to save the text to a file.)
There is a library that can help with Unicode issues called ftfy. Has made my life easier.
Example 1
import ftfy
print(ftfy.fix_text('ünicode'))
output -->
ünicode
Example 2 - UTF-8
import ftfy
print(ftfy.fix_text('\xe2\x80\xa2'))
output -->
•
Example 3 - Unicode code point
import ftfy
print(ftfy.fix_text(u'\u2026'))
output -->
…
pip install ftfy
Here is an example code
import unicodedata
raw_text = u"here $%6757 dfgdfg"
convert_text = unicodedata.normalize('NFKD', raw_text).encode('ascii','ignore')
No answere worked for my case, where I had a string variable containing unicode chars, and no encode-decode explained here did the work.
If I do in a Terminal
echo "no me llama mucho la atenci\u00f3n"
or
python3
>>> print("no me llama mucho la atenci\u00f3n")
The output is correct:
output: no me llama mucho la atención
But working with scripts loading this string variable didn't work.
This is what worked on my case, in case helps anybody:
string_to_convert = "no me llama mucho la atenci\u00f3n"
print(json.dumps(json.loads(r'"%s"' % string_to_convert), ensure_ascii=False))
output: no me llama mucho la atención
This is my function
import unicodedata
def unicode_to_ascii(note):
str_map = {'Š' : 'S', 'š' : 's', 'Đ' : 'D', 'đ' : 'd', 'Ž' : 'Z', 'ž' : 'z', 'Č' : 'C', 'č' : 'c', 'Ć' : 'C', 'ć' : 'c', 'À' : 'A', 'Á' : 'A', 'Â' : 'A', 'Ã' : 'A', 'Ä' : 'A', 'Å' : 'A', 'Æ' : 'A', 'Ç' : 'C', 'È' : 'E', 'É' : 'E', 'Ê' : 'E', 'Ë' : 'E', 'Ì' : 'I', 'Í' : 'I', 'Î' : 'I', 'Ï' : 'I', 'Ñ' : 'N', 'Ò' : 'O', 'Ó' : 'O', 'Ô' : 'O', 'Õ' : 'O', 'Ö' : 'O', 'Ø' : 'O', 'Ù' : 'U', 'Ú' : 'U', 'Û' : 'U', 'Ü' : 'U', 'Ý' : 'Y', 'Þ' : 'B', 'ß' : 'Ss', 'à' : 'a', 'á' : 'a', 'â' : 'a', 'ã' : 'a', 'ä' : 'a', 'å' : 'a', 'æ' : 'a', 'ç' : 'c', 'è' : 'e', 'é' : 'e', 'ê' : 'e', 'ë' : 'e', 'ì' : 'i', 'í' : 'i', 'î' : 'i', 'ï' : 'i', 'ð' : 'o', 'ñ' : 'n', 'ò' : 'o', 'ó' : 'o', 'ô' : 'o', 'õ' : 'o', 'ö' : 'o', 'ø' : 'o', 'ù' : 'u', 'ú' : 'u', 'û' : 'u', 'ý' : 'y', 'ý' : 'y', 'þ' : 'b', 'ÿ' : 'y', 'Ŕ' : 'R', 'ŕ' : 'r'}
for key, value in str_map.items():
note = note.replace(key, value)
asciidata = unicodedata.normalize('NFKD', note).encode('ascii', 'ignore')
return asciidata.decode('UTF-8')
I have made the following function which lets you control what to keep according to the General_Category_Values in Unicode (https://www.unicode.org/reports/tr44/#General_Category_Values)
def FormatToNameList(name_str):
import unicodedata
clean_str = ''
for c in name_str:
if unicodedata.category(c) in ['Lu','Ll']:
clean_str += c.lower()
print('normal letter: ',c)
elif unicodedata.category(c) in ['Lt','Lm','Lo']:
clean_str += c
print('special letter: ',c)
elif unicodedata.category(c) in ['Nd']:
clean_str += c
print('normal number: ',c)
elif unicodedata.category(c) in ['Nl','No']:
clean_str += c
print('special number: ',c)
elif unicodedata.category(c) in ['Cc','Sm','Zs','Zl','Zp','Pc','Pd','Ps','Pe','Pi','Pf','Po']:
clean_str += ' '
print('space or symbol: ',c)
else:
print('other: ',' : ',c,' unicodedata.category: ',unicodedata.category(c))
name_list = clean_str.split(' ')
return clean_str, name_list
if __name__ == '__main__':
u = 'some3^?"Weirdstr '+ chr(231) + chr(0x0af4)
[clean_str, name_list] = FormatToNameList(u)
print(clean_str)
print(name_list)
© 2022 - 2024 — McMap. All rights reserved.
print type(unicode_string), repr(unicode_string)
Python 3.x :print type(unicode_string), ascii(unicode_string)
Then edit your question and copy/paste the results of the above print statement. DON'T retype the results. Also look up near the top of your HTML and see if you can find something like this: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859 – Thomasunicode_string = u"I'm unicode string"; bytestring = unicode_string.encode('utf-8'); unicode_again = bytestring.decode('utf-8')
– Maximawnys
isjalf
encoded using rot-13 encoding. – Maximastr
means a bytes type that is not a real string type, but really Unicode strings are still "Python strings"...). – Mu