This is a follow-up to How can I convert JSON-encoded data that contains Unicode surrogate pairs to string?. In that question, the OP had a json.dumps()
-encoded file with an emoji represented as a surrogate pair - \ud83d\ude4f
. They were having problems reading the file and translating the emoji correctly, and the correct answer was to json.loads()
each line from the file, and the json
module would handle the conversion from surrogate pair back to (I'm assuming UTF8-encoded) emoji.
So here is my situation: say I have just a regular Python 3 unicode string with a surrogate pair in it:
emoji = "This is \ud83d\ude4f, an emoji."
How do I process this string to get a representation of the emoji out of it? I'm looking to get something like this:
"This is π, an emoji."
# or
"This is \U0001f64f, an emoji."
I've tried:
print(emoji)
print(emoji.encode("utf-8")) # also tried "ascii", "utf-16", and "utf-16-le"
json.loads(emoji) # and `.encode()` with various codecs
Generally I get an error similar to UnicodeEncodeError: XXX codec can't encode character '\ud83d' in position 8: surrogates no allowed
.
I'm running Python 3.5.1 on Linux, with $LANG
set to en_US.UTF-8
. I've run these samples both in the Python interpreter on the command line, and within IPython running in Sublime Text - there don't appear to be any differences.
tweepy
(and generally Twitter I guess) seems to be doing this. Mentioning it here in the hope that more Google searches for this problem will find this answer. β Dupreyftfy.fixes.fix_surrogates(text)
(third-party library) β Petrolatum