Python 3 - String with \xHH Hex Values to Unicode
Asked Answered
H

1

10

I am trying to convert a string with characters that require multiple hex values like this:

'Mahou Shoujo Madoka\xe2\x98\x85Magica'

to its unicode representation:

'Mahou Shoujo Madoka★Magica'

When I print the string, it tries to evaluate each hex value separately, so by default I get this:

x = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
print(x)

Mahou Shoujo MadokaâMagica

so I have tried some other StackOverflow answers, such as Best way to convert string to bytes in Python 3?:

x = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
z = x.encode('utf-8')
print('z:', z)
y = z.decode('utf-8')
print('y:', y)

z: b'Mahou Shoujo Madoka\xc3\xa2\xc2\x98\xc2\x85Magica'
y: Mahou Shoujo MadokaâMagica

Python: Convert Unicode-Hex-String to Unicode:

z = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
x = binascii.unhexlify(binascii.hexlify(z.encode('utf-8'))).decode('utf-8')
print('x:', x)

x: Mahou Shoujo MadokaâMagica

And some others, but none of them worked. Most of the results I found were people who had a double backslash problem, but none of them had my exact problem.

What I notice is that when I do str.encode, it seems to add some extra values into the binary (such as the difference between z and x in the first attempt), and I'm not quite sure why.

So I tried manually typing in the characters of the string into the binary:

x = b'Mahou Shoujo Madoka\xe2\x98\x85Magica'
x.decode('utf-8')

'Mahou Shoujo Madoka★Magica'

and it worked. But I couldn't find a way to convert from a string to a binary literally other than typing it out. Where am I going wrong?

Habile answered 14/3, 2017 at 5:27 Comment(0)
C
9

In Python 3 your original string is a Unicode string, but contains Unicode code points that look like UTF-8, but decoded incorrectly. To fix it:

>>> s = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
>>> type(s)
<class 'str'>
>>> s.encode('latin1')
b'Mahou Shoujo Madoka\xe2\x98\x85Magica'
>>> s.encode('latin1').decode('utf8')
'Mahou Shoujo Madoka★Magica'

The latin1 encoding happens to map 1:1 to the first 256 code points in Unicode, so .encode('latin1') translates the code points directly back to bytes. Then you can .decode('utf8') the bytes properly.

Cuckooflower answered 14/3, 2017 at 5:33 Comment(2)
To assign a string to a variable, you can shorten the above to s = b"\xe2\x98\x85".decode("utf8") as well.Clothier
@Clothier you could shorten it to s='★' but that wasn't the point of the question.Cuckooflower

© 2022 - 2024 — McMap. All rights reserved.