The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as '\ud83d\ude4f'
into a single non-BMP unicode character (the answer being "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
). I would like to know how to do this in reverse. How can I, using Python, find the equivalent surrogate pair from a non-BMP character, converting '\U0001f64f'
(π) back to '\ud83d\ude4f'
. I couldn't find a clear answer to that.
Python: Find equivalent surrogate pair from non-BMP unicode char
You'll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression:
import re
_nonbmp = re.compile(r'[\U00010000-\U0010FFFF]')
def _surrogatepair(match):
char = match.group()
assert ord(char) > 0xffff
encoded = char.encode('utf-16-le')
return (
chr(int.from_bytes(encoded[:2], 'little')) +
chr(int.from_bytes(encoded[2:], 'little')))
def with_surrogates(text):
return _nonbmp.sub(_surrogatepair, text)
Demo:
>>> with_surrogates('\U0001f64f')
'\ud83d\ude4f'
If you already know you have a code point outside of the BMP, then of course the regex part is not necessary. Just
x = char.encode('utf-16-le'); return [chr(int.from_bytes(y, 'little')) for y in (x[0:2], x[2:4])]
β
Formally It's a little complex, but here's a one-liner to convert a single character:
>>> emoji = '\U0001f64f'
>>> ''.join(chr(x) for x in struct.unpack('>2H', emoji.encode('utf-16be')))
'\ud83d\ude4f'
To convert a mix of characters requires surrounding that expression with another:
>>> emoji_str = 'Here is a non-BMP character: \U0001f64f'
>>> ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack('>2H', c.encode('utf-16be'))) for c in emoji_str)
'Here is a non-BMP character: \ud83d\ude4f'
I stayed away from
str.join()
for just two values; I found using two chr()
calls to be more readable; I didn't test this on speed however. Using your one-liner to process each character one by one in a for
loop is going to be very slow compared to a re.sub()
approach (which can scan text in a C loop). β
Mu Remark:
struct.unpack
this way makes it work for exactly one emoji character. For a string it's possible to use x=array.array("H"); x.frombytes( <byte array in UTF 16 LE> );
β
Ey © 2022 - 2024 β McMap. All rights reserved.
'\ud83d\ude4f'
string, or would the UTF-16 encoding do? β Muhex(ord(b"\x90".decode('u8', "surrogateescape")))
(β 0xDC90) -------- Instead, use the UTF-16 encodedbytes
object, or just a list of int UTF16 codepoints.. β Ey