Why do Unicode emoji property escapes match numbers?

Asked 16/10, 2020 at 12:35 Answered 19/5, 2023 at 13:0

I found this awesome way to detect emojis using a regex that doesn't use "huge magic ranges" by using a Unicode property escape:

console.log(/\p{Emoji}/u.test('flowers 🌼🌺🌸')) // true
console.log(/\p{Emoji}/u.test('flowers')) // false

But when I shared this knowledge in this answer, @Bronzdragon noticed that \p{Emoji} also matches numbers! Why is that? Numbers are not emojis?

console.log(/\p{Emoji}/u.test('flowers 123')) // unexpectdly true

// regex-only workaround by @Bonzdragon
const regex = /(?=\p{Emoji})(?!\p{Number})/u;
console.log(
  regex.test('flowers'), // false, as expected
  regex.test('flowers 123'), // false, as expected
  regex.test('flowers 123 🌼🌺🌸'), // true, as expected
  regex.test('flowers 🌼🌺🌸'), // true, as expected
)

// more readable workaround
const hasEmoji = str => {
  const nbEmojiOrNumber = (str.match(/\p{Emoji}/gu) || []).length;
  const nbNumber = (str.match(/\p{Number}/gu) || []).length;
  return nbEmojiOrNumber > nbNumber;
}
console.log(
  hasEmoji('flowers'), // false, as expected
  hasEmoji('flowers 123'), // false, as expected
  hasEmoji('flowers 123 🌼🌺🌸'), // true, as expected
  hasEmoji('flowers 🌼🌺🌸'), // true, as expected
)

Atmospherics answered 16/10, 2020 at 12:35 Comment(6)

Note that the workaround also fails for '123 flowers 🌼🌺🌸' for example - that should return true, as it definitely has emoji. – Churchless 16/10, 2020 at 12:38

why not just remove all numbers then do the check? – Tierratiersten 16/10, 2020 at 12:40

The question is not how to fix it (here is a fix), the question is why. Else, let's close it. – Etti 16/10, 2020 at 12:43

@WiktorStribiżew indeed, I am asking why, also I don't want to use one of these range-based regex because they're extremely long, unreadable, magic, and not resilient to the adding of new emojis – Atmospherics 16/10, 2020 at 13:18

I think the answer is here and all thread after that post. This is not a bug. # and 0-9 are Emoji characters with a text representation by default, per the Unicode Standard. – Etti 16/10, 2020 at 13:25

This post goes into more detail and you probably can use the /\p{Extended_Pictographic}/u regex to match emojis except for some keycap base characters that are still emojis. – Etti 16/10, 2020 at 13:35

NOTE: To match any Emoji character in the contemporary JavaScript code, you may use

// EXTRACT:
console.log( 'flowers 🌼🌺🌸'.match(/\p{RGI_Emoji}/vg) ); // => ['🌼', '🌺', '🌸']
// TEST IF PRESENT:
console.log( /\p{RGI_Emoji}/v.test('flowers 🌼🌺🌸') ); // => true
// COUNT:
console.log( 'flowers 🌼🌺🌸'.match(/\p{RGI_Emoji}/vg).length ); // => 3

The answer to the current question

According to this post, digtis, #, *, ZWJ and some more chars contain the Emoji property set to Yes, which means digits are considered valid emoji chars:

0023          ; Emoji_Component      #  1.1  [1] (#️)       number sign
002A          ; Emoji_Component      #  1.1  [1] (*️)       asterisk
0030..0039    ; Emoji_Component      #  1.1 [10] (0️..9️)    digit zero..digit nine
200D          ; Emoji_Component      #  1.1  [1] (‍)        zero width joiner
20E3          ; Emoji_Component      #  3.0  [1] (⃣)       combining enclosing keycap
FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (🇦..🇿)    regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (🏻..🏿)    light skin tone..dark skin tone
1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (🦰..🦳)    red-haired..white-haired
E0020..E007F  ; Emoji_Component      #  3.1 [96] (󠀠..󠁿)      tag space..cancel tag

For example, 1 is a digit, but it becomes an emoji when combined with U+FE0F and U+20E3 chars: 1️⃣:

console.log("1\uFE0F\u20E3 2\uFE0F\u20E3 3\uFE0F\u20E3 4\uFE0F\u20E3 5\uFE0F\u20E3 6\uFE0F\u20E3 7\uFE0F\u20E3 8\uFE0F\u20E3 9\uFE0F\u20E3 0\uFE0F\u20E3")

Atherton answered 16/10, 2020 at 21:32 Comment(6)

Thanks for the effort on the answer, if you could tell me why the unicode consortion considers 0123456789#* as emojis that'd be perfect! – Atmospherics 17/10, 2020 at 15:27

@NinoFiliu I added a demo showing how 1 turns into an emoji. – Etti 17/10, 2020 at 17:15

Note that if you use this regex to remove emojis from strings (e.g.

'❌🙅‍♀️🙅‍♂️🙅😤😠😡'.replace(/[\p{Extended_Pictographic}\u{1F3FB}-\u{1F3FF}\u{1F9B0}-\u{1F9B3}]/gu, '')

), there will be some leftover characters in the string (above resulting string has length 4). For this use case, I ended up using the emoji-regex npm package to match them. – Cold 7/3, 2023 at 6:51

@Cold I have an all-embracing regex for Emojis V14.0, but I need to update it for the current 15.1. This answer is more about the \p{Emoji} construct. People just freak out when they see long regex patterns, so I tried to come up with something based on the Unicode category classes that is short and good enough. – Etti 7/3, 2023 at 8:18

@WiktorStribiżew I agree, I think your solution is short and good enough for checking for the presence of emojis in a string. Furthermore, your answer is relevant to the question (you answered why) while mine isn’t (I suggested an npm package for a particular use case). However, I added the comment above here because this StackOverflow post came up first on Google when I try to debug the problem where .replace(/\p{Emoji}/gu, '') deleted the numbers. – Cold 8/3, 2023 at 9:13

one way to think about this is that \p{Emoji} means "can this ever be part of an emoji" not "is this always an emoji". so it would be useful for eg checking whether a string is entirely composed of emoji – Chomp 30/8, 2023 at 13:59

One of the problems with using \p{Emoji} is that Unicode defines Emoji as a character property, meaning it only captures individual characters or code points. As a result, \p{Emoji} might seem to solve your problem as long as you only test it against single-code point emoji such as 🫱 (U+1FAF1), but that’s misleading.

However, the vast majority of emoji defined by Unicode consist of multiple code points, and thus cannot be matched by \p{Emoji}. For example: 🫱🏿‍🫲🏻 (U+1FAF1 U+1F3FF U+200D U+1FAF2 U+1F3FB).

const reEmojiCharacter = /^\p{Emoji}$/u;
reEmojiCharacter.test('🫱'); // → true
reEmojiCharacter.test('🫱🏿‍🫲🏻'); // → false

Luckily, Unicode defines several properties of strings, which — you guessed it — are not restricted to just 1 code point at a time. The property of strings called RGI_Emoji includes all emoji that are officially recommended for general interchange, and is likely what you really want instead of Emoji.

In JavaScript regular expressions, you can use properties of strings when enabling the v flag.

const reEmoji = /^\p{RGI_Emoji}$/v;
reEmoji.test('🫱'); // → true
reEmoji.test('🫱🏿‍🫲🏻'); // → true

Resistor answered 19/5, 2023 at 13:0 Comment(1)

Nice catch! I added your answer as a "see also" link of this answer – Atmospherics 22/5, 2023 at 8:42

Recommended topics

Hot tags