A python regex that matches the regional indicator character class
Asked Answered
G

1

7

I am using python 2.7.10 on a Mac. Flags in emoji are indicated by a pair of Regional Indicator Symbols. I would like to write a python regex to insert spaces between a string of emoji flags.

  • For example, this string is two Brazilian flags:

    • u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7"

    • which will render like this: πŸ‡§πŸ‡·πŸ‡§πŸ‡·

I'd like to insert spaces between any pair of regional indicator symbols. Something like this:

re.sub(re.compile(u"([\U0001F1E6-\U0001F1FF][\U0001F1E6-\U0001F1FF])"),
       r"\1 ", 
       u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7")

...which would result in:

u"\U0001F1E7\U0001F1F7 \U0001F1E7\U0001F1F7 "

...but that code gives me an error:

sre_constants.error: bad character range

A hint (I think) at what's going wrong is the following, which shows that \U0001F1E7 is turning into two "characters" in the regex:

re.search(re.compile(u"([\U0001F1E7])"),
          u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7").group(0)

This results in:

u'\ud83c'

Sadly my understanding of unicode is too weak for me to make further progress.

Girovard answered 23/8, 2016 at 18:26 Comment(3)
It is giving me this: 'πŸ‡§πŸ‡· πŸ‡§πŸ‡· ' in python 3.5.1 for your first try. – Short
This code works in python2.7 on arch linux. – Unstained
This is no longer an issue on most Python 3.x builds. Your code should assert sys.maxunicode is >= 1114111 (wide builds), not 65535 (narrow builds). See Unicode in Python - just UTF-16? – Skied
R
11

I believe you're using Python 2.7 in Windows or Mac, which has the narrow 16-bit Unicode build - Linux/Glibc usually have 32-bit full unicode, also Python 3.5 has wide Unicode on all platforms.

What you see is the one code being split into a surrogate pair. Unfortunately it also means that you cannot use a single character class easily for this task. However it is still possible. The UTF-16 representation of U+1F1E6 (πŸ‡¦) is \uD83C\uDDE6, and that of U+1F1FF (πŸ‡Ώ) is \uD83C\uDDFF.

I do not even have an access to such Python build at all, but you could try

\uD83C[\uDDE6-\uDDFF]

as a replacement for single [\U0001F1E6-\U0001F1FF], thus your whole regex would be

(\uD83C[\uDDE6-\uDDFF]\uD83C[\uDDE6-\uDDFF])

The reason why the character class doesn't work is that it tries to make a range from the second half of the first surrogate pair to the first half of the second surrogate pair - this fails, because the start of the range is lexicographically greater than the end.

However, this regular expression still wouldn't work on Linux, you need to use the original there as Linux builds use wide unicode by default.


Alternatively, upgrade your Windows Python to 3.5 or better.

Relationship answered 23/8, 2016 at 18:32 Comment(0)

© 2022 - 2024 β€” McMap. All rights reserved.