Convert UTF-8 to string literals in Python
Asked Answered
S

2

7

I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:

My string is: 'Entre\xc3\xa9'

Example one:

This code:

u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')

returns the result: u'Entre\xe9'

If I then continue by printing this:

print u'Entre\xe9'

I get the result: Entreé

This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?

Example:

a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b

I would like result of "c" to be:

Entreé
Samul answered 4/7, 2014 at 10:5 Comment(0)
J
10

The u'' syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode object being created, but that's not the only way to create such an object.

You cannot make a unicode value from a byte string by adding u in front of it. But if you called str.decode() with the right encoding, you get a unicode value. Vice-versa, you can encode unicode objects to byte strings with unicode.encode().

Note that when displaying a unicode object, Python represents it by using the Unicode string literal syntax again (so u'...'), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.

Your a value is defined using a byte string literal, so you only need to decode:

a = 'Entre\xc3\xa9'
b = a.decode('utf8')

Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.

You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:

Johannesburg answered 4/7, 2014 at 10:9 Comment(2)
Many Thanks! So now if I enter: b into the python interpreter I get: u'Entre\xe9' If I enter: print b I get: Entreé Is it possible to have a string variable that will automatically return Entreé without using the print statement?Samul
@user3804963: I think you are confusing the representation (u'Entre\xe9') with the value. print shows you the value (as encoded for your terminal), while your python console shows you the representation (for debugging). No value change has taken place. Python is showing you a value that can be copied and pasted into your source code without having to declare a source code encoding beyond the default ASCII, so an escape sequence (\xe9) is shown for the U+00E9 Unicode codepoint. This is normal.Johannesburg
D
0
>>> chr(0x24E1)
'ⓡ'
>>> chr(0x24E9)
'ⓩ'
>>> chr(0x24E7)
'ⓧ'

doc: https://docs.python.org/3/howto/unicode.html

Dynamometer answered 7/8 at 22:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.