In Python 3, suppose I have
>>> thai_string = 'สีเ'
Using encode
gives
>>> thai_string.encode('utf-8')
b'\xe0\xb8\xaa\xe0\xb8\xb5'
My question: how can I get encode()
to return a bytes
sequence using \u
instead of \x
? And how can I decode
them back to a Python 3 str
type?
I tried using the ascii
builtin, which gives
>>> ascii(thai_string)
"'\\u0e2a\\u0e35'"
But this doesn't seem quite right, as I can't decode it back to obtain thai_string
.
Python documentation tells me that
\xhh
escapes the character with the hex valuehh
while\uxxxx
escapes the character with the 16-bit hex valuexxxx
The documentation says that \u
is only used in string literals, but I'm not sure what that means. Is this a hint that my question has a flawed premise?
.decode('utf-8')
? Aren't strings in Python unicode anyway? – Paramedicthai_string
norascii(thai_string)
have adecode
method, andthai_string.encode('utf-8').decode('utf-8')
brings me back to where I started,thai_string
, which is not the desired output. – Lashondalashonde\u
: docs.python.org/3/reference/lexical_analysis.html and docs.python.org/3/library/codecs.html#encodings-and-unicode – Sabinasabineascii(sku).replace(r"\x", r"\u00")
and works better – Polka