Python: Find equivalent surrogate pair from non-BMP unicode char
Asked Answered
B

2

12

The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as '\ud83d\ude4f' into a single non-BMP unicode character (the answer being "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')). I would like to know how to do this in reverse. How can I, using Python, find the equivalent surrogate pair from a non-BMP character, converting '\U0001f64f' (πŸ™) back to '\ud83d\ude4f'. I couldn't find a clear answer to that.

Bombsight answered 24/10, 2016 at 16:13 Comment(5)
Do you absolutely need the (technically invalid) '\ud83d\ude4f' string, or would the UTF-16 encoding do? – Mu
I'm not sure, but I think so. Typing print('\U0001f64f') on the IDLE shell will raise an error message "Non-BMP character not supported in Tk", but typing print('\ud83d\ude4f') (on IDLE) will in fact print the non-BMP emoji character to the IDLE shell, which is supposed to be impossible. – Bombsight
Printing non-BMP characters onto the IDLE screen is supposedly impossible, but using surrogate pairs at least some of them are printable. That's why I need the "technically invalid" string '\ud83d\ude4f'. If you know another way to print the character to IDLE (using UTF-18 encoding perhaps), that's fine, but finding the surrogate pair will do. – Bombsight
Note that you normally don't want to have raw surrogate characters in normal Python string. Sometimes Python use them for other purposes (see PEP 0383, and try running hex(ord(b"\x90".decode('u8', "surrogateescape"))) (β†’ 0xDC90) -------- Instead, use the UTF-16 encoded bytes object, or just a list of int UTF16 codepoints.. – Ey
In fact, in new Python versions this is no longer really needed as IDLE now somewhat supports non-BMP characters. Not perfectly, editing lines with non-BMP characters results in weird behavior, but at least they can be printed and pasted without errors or crashing. I'm currently using Python 3.9.1 on Windows 10 (and emojis can be pasted and printed without any need for surrogate pairs), but anyone using, say, Python 3.6, may still find this page useful. – Bombsight
M
5

You'll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression:

import re

_nonbmp = re.compile(r'[\U00010000-\U0010FFFF]')

def _surrogatepair(match):
    char = match.group()
    assert ord(char) > 0xffff
    encoded = char.encode('utf-16-le')
    return (
        chr(int.from_bytes(encoded[:2], 'little')) + 
        chr(int.from_bytes(encoded[2:], 'little')))

def with_surrogates(text):
    return _nonbmp.sub(_surrogatepair, text)

Demo:

>>> with_surrogates('\U0001f64f')
'\ud83d\ude4f'
Mu answered 24/10, 2016 at 16:28 Comment(1)
If you already know you have a code point outside of the BMP, then of course the regex part is not necessary. Just x = char.encode('utf-16-le'); return [chr(int.from_bytes(y, 'little')) for y in (x[0:2], x[2:4])] – Formally
W
3

It's a little complex, but here's a one-liner to convert a single character:

>>> emoji = '\U0001f64f'
>>> ''.join(chr(x) for x in struct.unpack('>2H', emoji.encode('utf-16be')))
'\ud83d\ude4f'

To convert a mix of characters requires surrounding that expression with another:

>>> emoji_str = 'Here is a non-BMP character: \U0001f64f'
>>> ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack('>2H', c.encode('utf-16be'))) for c in emoji_str)
'Here is a non-BMP character: \ud83d\ude4f'
Wisner answered 24/10, 2016 at 17:23 Comment(2)
I stayed away from str.join() for just two values; I found using two chr() calls to be more readable; I didn't test this on speed however. Using your one-liner to process each character one by one in a for loop is going to be very slow compared to a re.sub() approach (which can scan text in a C loop). – Mu
Remark: struct.unpack this way makes it work for exactly one emoji character. For a string it's possible to use x=array.array("H"); x.frombytes( <byte array in UTF 16 LE> ); – Ey

© 2022 - 2024 β€” McMap. All rights reserved.