How can I convert surrogate pairs to normal string in Python?
Asked Answered
A

2

55

This is a follow-up to How can I convert JSON-encoded data that contains Unicode surrogate pairs to string?. In that question, the OP had a json.dumps()-encoded file with an emoji represented as a surrogate pair - \ud83d\ude4f. They were having problems reading the file and translating the emoji correctly, and the correct answer was to json.loads() each line from the file, and the json module would handle the conversion from surrogate pair back to (I'm assuming UTF8-encoded) emoji.

So here is my situation: say I have just a regular Python 3 unicode string with a surrogate pair in it:

emoji = "This is \ud83d\ude4f, an emoji."

How do I process this string to get a representation of the emoji out of it? I'm looking to get something like this:

"This is πŸ™, an emoji."
# or
"This is \U0001f64f, an emoji."

I've tried:

print(emoji)
print(emoji.encode("utf-8")) # also tried "ascii", "utf-16", and "utf-16-le"
json.loads(emoji) # and `.encode()` with various codecs

Generally I get an error similar to UnicodeEncodeError: XXX codec can't encode character '\ud83d' in position 8: surrogates no allowed.

I'm running Python 3.5.1 on Linux, with $LANG set to en_US.UTF-8. I've run these samples both in the Python interpreter on the command line, and within IPython running in Sublime Text - there don't appear to be any differences.

Allergist answered 1/7, 2016 at 13:55 Comment(4)
tweepy (and generally Twitter I guess) seems to be doing this. Mentioning it here in the hope that more Google searches for this problem will find this answer. – Duprey
In the reverse direction (single character to surrogate pair): Python: Find equivalent surrogate pair from non-BMP unicode char - Stack Overflow – Petrolatum
Looking around python - Current idiom for removing 'surrogateescape' characters fron a decoded string - Stack Overflow mentions ftfy.fixes.fix_surrogates(text) (third-party library) – Petrolatum
To follow up on my earlier comment, actually anything which uses JSON to store Unicode strings is forced to use surrogate pairs, because the JSON string notation does not support non-BMP code points natively. – Duprey
T
67

You've mixed a literal string \ud83d in a json file on disk (six characters: \ u d 8 3 d) and a single character u'\ud83d' (specified using a string literal in Python source code) in memory. It is the difference between len(r'\ud83d') == 6 and len('\ud83d') == 1 on Python 3.

If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:

>>> "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
'πŸ™'

Python 2 was more permissive.

Note: even if your json file contains literal \ud83d\ude4f (12 characters); you shouldn't get the surrogate pair:

>>> print(ascii(json.loads(r'"\ud83d\ude4f"')))
'\U0001f64f'

Notice: the result is 1 character ( '\U0001f64f'), not the surrogate pair ('\ud83d\ude4f').

Type answered 1/7, 2016 at 14:28 Comment(0)
D
21

Because this is a recurring question and the error message is slightly obscure, here is a more detailed explanation.

Surrogates are a way to express Unicode code points bigger than U+FFFF.

Recall that Unicode was originally specified to contain 65,536 characters, but that it was soon found that this was not enough to accommodate all the glyphs of the world.

As an extension mechanism for the (otherwise fixed-width) UTF-16 encoding, a reserved area was set up to contain a mechanism for expressing code points outside the Basic Multilingual Plane: Any code point in this special area would have to be followed by another character code from the same area, and together, they would express a code point with a number larger than the old limit.

(Strictly speaking, the surrogates area is divided into two halves; the first surrogate in a pair needs to come from the High Surrogates half, and the second, from the Low Surrogates. Confusingly, the High Surrogates U+D800-U+DBFF have lower code point numbers than the Low Surrogates U+DC00-U+DFFF.)

This is a legacy mechanism to support the UTF-16 encoding specifically, and should not be used in other encodings; they do not need it, and the applicable standards specifically say that this is disallowed.

In other words, while U+12345 can be expressed with the surrogate pair U+D808 U+DF45, you should simply express it directly instead unless you are specifically using UTF-16.

In some more detail, here is how this would be expressed in UTF-8 as a single character:

0xF0 0x92 0x8D 0x85

And here is the corresponding surrogate sequence:

0xED 0xA0 0x88
0xED 0xBD 0x85

As already suggested in the accepted answer, you can round-trip with something like

>>> "\ud808\udf45".encode('utf-16', 'surrogatepass').decode('utf-16').encode('utf-8')
b'\xf0\x92\x8d\x85'

Perhaps see also http://www.russellcottrell.com/greek/utilities/surrogatepaircalculator.htm

Duprey answered 6/2, 2019 at 8:16 Comment(9)
Related:#33642839 – Duprey
this is misleading. "\U0001f64f" is a single character (1 Unicode code point here) in Python (You don't need surrogate pairs to represent it). Now, various character encodings may require several code units to represent it (utf-8 uses 8 bit (1 byte) code units, utf-16 uses 16-bit (2 bytes) code units. Both are variable width encodings: a single unicode code point may require several code units. JSON allows UTF-16 surrogate pairs for \u escapes. But again, it also doesn't require them. In principle, JSON allows to keep the surrogate pair as two code point but Python doesn't support it – Type
@Type Which part do you think is misleading? I should perhaps clarify that JSON has no other mechanism than surrogates to express code points outside the BMP. – Duprey
this is wrong: "JSON has no other mechanism than surrogates to express code points outside the BMP" You can put any Unicode character as is inside a json string (except for " quote, and \` slash, or control char ) e.g., "πŸ™"` is a valid json string according to the standard ecma-international.org/wp-content/uploads/… (you can express it in json via surrogate escapes but you don't have to) – Type
Right; the correct articulation would be "JSON has no mechanism for specifying a non-BMP code point using a single escape sequence". Some use cases require or at least recommend that you keep your JSON as 7-bit ASCII. – Duprey
even if there is surrogate pair (12 chars) in the json source, you should see just one Unicode code point in your Python code when you load it. See the explicit example in my answer. – Type
The question is about why Python refuses to load a surrogate pair and what you can do about it. Of course it is a single code point. I fail to see how you could construe this to mean anything else. json.loads handles it fine but if you are reading e.g. a CSV file it will fail. My answer explains why and suggests some possible solutions. – Duprey
no, the question is why emoji.encode("utf-8") produces UnicodeEncodeError error and the likely answer is that OP copy-pasted from a json file into a Python string literal, and therefore it is a bug (the preferred solution is to fix upstream), that is why my answer talks about the difference between r'\ud83d' and '\ud83d'. You lost me around mentioning CSV. csv works with non-BMP Unicode characters just fine – Type
The reason I posted a second answer is that a lot of people are trying to import e.g. Twitter CSV data which uses the JSON encoding with surrogate pairs, and it wasn't obvious why this was a duplicate for their question. – Duprey

© 2022 - 2024 β€” McMap. All rights reserved.