How to encode Python 3 string using \u escape code?

About

Asked 28/8, 2015 at 22:39 Answered 28/8, 2015 at 22:46

Solved python python-3.x unicode unicode-escapes

In Python 3, suppose I have

>>> thai_string = 'สีเ'

Using encode gives

>>> thai_string.encode('utf-8')
b'\xe0\xb8\xaa\xe0\xb8\xb5'

My question: how can I get encode() to return a bytes sequence using \u instead of \x? And how can I decode them back to a Python 3 str type?

I tried using the ascii builtin, which gives

>>> ascii(thai_string)
"'\\u0e2a\\u0e35'"

But this doesn't seem quite right, as I can't decode it back to obtain thai_string.

Python documentation tells me that

\xhh escapes the character with the hex value hh while
\uxxxx escapes the character with the 16-bit hex value xxxx

The documentation says that \u is only used in string literals, but I'm not sure what that means. Is this a hint that my question has a flawed premise?

Lashondalashonde answered 28/8, 2015 at 22:39 Comment(7)

What about .decode('utf-8')? Aren't strings in Python unicode anyway? – Paramedic 28/8, 2015 at 22:45

@Zizouz212, neither thai_string nor ascii(thai_string) have a decode method, and thai_string.encode('utf-8').decode('utf-8') brings me back to where I started, thai_string, which is not the desired output. – Lashondalashonde 28/8, 2015 at 23:0

Python documentation relevant to the escape sequence \u: docs.python.org/3/reference/lexical_analysis.html and docs.python.org/3/library/codecs.html#encodings-and-unicode – Sabinasabine 8/4, 2021 at 2:56

Relevant: https://mcmap.net/q/74841/-quot-unicode-error-39-unicodeescape-39-codec-can-39-t-decode-bytes-quot-when-writing-windows-file-paths-duplicate/1959808 – Sabinasabine 8/4, 2021 at 3:1

Does this answer your question? How to work with surrogate pairs in Python? – Lynea 26/7, 2021 at 15:45

I also use ascii(sku).replace(r"\x", r"\u00") and works better – Polka 27/7, 2021 at 22:48

@FelipeBuccioni That code corrupts strings that contain a backslash followed by a literal x. – Foremast 4/11, 2021 at 16:53

You can use unicode_escape:

>>> thai_string.encode('unicode_escape')
b'\\u0e2a\\u0e35\\u0e40'

Note that encode() will always return a byte string (bytes) and the unicode_escape encoding is intended to:

Produce a string that is suitable as Unicode literal in Python source code

Hadik answered 28/8, 2015 at 22:46 Comment(3)

Perfect. But why does this string have two slashes before the "u" while the "x" only has one? – Lashondalashonde 31/8, 2015 at 3:26

This is simply how Python displays a literal backslash inside a quoted string. Compare '\\n' (literal backslash, literal n) to '\n' (newline character). – Ealing 4/11, 2021 at 15:27

If you want the result as a string, you can tack on .decode('ascii') – Ealing 4/11, 2021 at 15:28

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags