Why do Unicode emoji property escapes match numbers?
Asked Answered
A

2

19

I found this awesome way to detect emojis using a regex that doesn't use "huge magic ranges" by using a Unicode property escape:

console.log(/\p{Emoji}/u.test('flowers ๐ŸŒผ๐ŸŒบ๐ŸŒธ')) // true
console.log(/\p{Emoji}/u.test('flowers')) // false

But when I shared this knowledge in this answer, @Bronzdragon noticed that \p{Emoji} also matches numbers! Why is that? Numbers are not emojis?

console.log(/\p{Emoji}/u.test('flowers 123')) // unexpectdly true

// regex-only workaround by @Bonzdragon
const regex = /(?=\p{Emoji})(?!\p{Number})/u;
console.log(
  regex.test('flowers'), // false, as expected
  regex.test('flowers 123'), // false, as expected
  regex.test('flowers 123 ๐ŸŒผ๐ŸŒบ๐ŸŒธ'), // true, as expected
  regex.test('flowers ๐ŸŒผ๐ŸŒบ๐ŸŒธ'), // true, as expected
)

// more readable workaround
const hasEmoji = str => {
  const nbEmojiOrNumber = (str.match(/\p{Emoji}/gu) || []).length;
  const nbNumber = (str.match(/\p{Number}/gu) || []).length;
  return nbEmojiOrNumber > nbNumber;
}
console.log(
  hasEmoji('flowers'), // false, as expected
  hasEmoji('flowers 123'), // false, as expected
  hasEmoji('flowers 123 ๐ŸŒผ๐ŸŒบ๐ŸŒธ'), // true, as expected
  hasEmoji('flowers ๐ŸŒผ๐ŸŒบ๐ŸŒธ'), // true, as expected
)
Atmospherics answered 16/10, 2020 at 12:35 Comment(6)
Note that the workaround also fails for '123 flowers ๐ŸŒผ๐ŸŒบ๐ŸŒธ' for example - that should return true, as it definitely has emoji. โ€“ Churchless
why not just remove all numbers then do the check? โ€“ Tierratiersten
The question is not how to fix it (here is a fix), the question is why. Else, let's close it. โ€“ Etti
@WiktorStribiลผew indeed, I am asking why, also I don't want to use one of these range-based regex because they're extremely long, unreadable, magic, and not resilient to the adding of new emojis โ€“ Atmospherics
I think the answer is here and all thread after that post. This is not a bug. # and 0-9 are Emoji characters with a text representation by default, per the Unicode Standard. โ€“ Etti
This post goes into more detail and you probably can use the /\p{Extended_Pictographic}/u regex to match emojis except for some keycap base characters that are still emojis. โ€“ Etti
A
19

NOTE: To match any Emoji character in the contemporary JavaScript code, you may use

// EXTRACT:
console.log( 'flowers ๐ŸŒผ๐ŸŒบ๐ŸŒธ'.match(/\p{RGI_Emoji}/vg) ); // => ['๐ŸŒผ', '๐ŸŒบ', '๐ŸŒธ']
// TEST IF PRESENT:
console.log( /\p{RGI_Emoji}/v.test('flowers ๐ŸŒผ๐ŸŒบ๐ŸŒธ') ); // => true
// COUNT:
console.log( 'flowers ๐ŸŒผ๐ŸŒบ๐ŸŒธ'.match(/\p{RGI_Emoji}/vg).length ); // => 3

The answer to the current question

According to this post, digtis, #, *, ZWJ and some more chars contain the Emoji property set to Yes, which means digits are considered valid emoji chars:

0023          ; Emoji_Component      #  1.1  [1] (#๏ธ)       number sign
002A          ; Emoji_Component      #  1.1  [1] (*๏ธ)       asterisk
0030..0039    ; Emoji_Component      #  1.1 [10] (0๏ธ..9๏ธ)    digit zero..digit nine
200D          ; Emoji_Component      #  1.1  [1] (โ€)        zero width joiner
20E3          ; Emoji_Component      #  3.0  [1] (โƒฃ)       combining enclosing keycap
FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (๐Ÿ‡ฆ..๐Ÿ‡ฟ)    regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (๐Ÿป..๐Ÿฟ)    light skin tone..dark skin tone
1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (๐Ÿฆฐ..๐Ÿฆณ)    red-haired..white-haired
E0020..E007F  ; Emoji_Component      #  3.1 [96] (๓ € ..๓ ฟ)      tag space..cancel tag

For example, 1 is a digit, but it becomes an emoji when combined with U+FE0F and U+20E3 chars: 1๏ธโƒฃ:

console.log("1\uFE0F\u20E3 2\uFE0F\u20E3 3\uFE0F\u20E3 4\uFE0F\u20E3 5\uFE0F\u20E3 6\uFE0F\u20E3 7\uFE0F\u20E3 8\uFE0F\u20E3 9\uFE0F\u20E3 0\uFE0F\u20E3")
Atherton answered 16/10, 2020 at 21:32 Comment(6)
Thanks for the effort on the answer, if you could tell me why the unicode consortion considers 0123456789#* as emojis that'd be perfect! โ€“ Atmospherics
@NinoFiliu I added a demo showing how 1 turns into an emoji. โ€“ Etti
Note that if you use this regex to remove emojis from strings (e.g. 'โŒ๐Ÿ™…โ€โ™€๏ธ๐Ÿ™…โ€โ™‚๏ธ๐Ÿ™…๐Ÿ˜ค๐Ÿ˜ ๐Ÿ˜ก'.replace(/[\p{Extended_Pictographic}\u{1F3FB}-\u{1F3FF}\u{1F9B0}-\u{1F9B3}]/gu, '')), there will be some leftover characters in the string (above resulting string has length 4). For this use case, I ended up using the emoji-regex npm package to match them. โ€“ Cold
@Cold I have an all-embracing regex for Emojis V14.0, but I need to update it for the current 15.1. This answer is more about the \p{Emoji} construct. People just freak out when they see long regex patterns, so I tried to come up with something based on the Unicode category classes that is short and good enough. โ€“ Etti
@WiktorStribiลผew I agree, I think your solution is short and good enough for checking for the presence of emojis in a string. Furthermore, your answer is relevant to the question (you answered why) while mine isnโ€™t (I suggested an npm package for a particular use case). However, I added the comment above here because this StackOverflow post came up first on Google when I try to debug the problem where .replace(/\p{Emoji}/gu, '') deleted the numbers. โ€“ Cold
one way to think about this is that \p{Emoji} means "can this ever be part of an emoji" not "is this always an emoji". so it would be useful for eg checking whether a string is entirely composed of emoji โ€“ Chomp
R
5

One of the problems with using \p{Emoji} is that Unicode defines Emoji as a character property, meaning it only captures individual characters or code points. As a result, \p{Emoji} might seem to solve your problem as long as you only test it against single-code point emoji such as ๐Ÿซฑ (U+1FAF1), but thatโ€™s misleading.

However, the vast majority of emoji defined by Unicode consist of multiple code points, and thus cannot be matched by \p{Emoji}. For example: ๐Ÿซฑ๐Ÿฟโ€๐Ÿซฒ๐Ÿป (U+1FAF1 U+1F3FF U+200D U+1FAF2 U+1F3FB).

const reEmojiCharacter = /^\p{Emoji}$/u;
reEmojiCharacter.test('๐Ÿซฑ'); // โ†’ true
reEmojiCharacter.test('๐Ÿซฑ๐Ÿฟโ€๐Ÿซฒ๐Ÿป'); // โ†’ false

Luckily, Unicode defines several properties of strings, which โ€” you guessed it โ€” are not restricted to just 1 code point at a time. The property of strings called RGI_Emoji includes all emoji that are officially recommended for general interchange, and is likely what you really want instead of Emoji.

In JavaScript regular expressions, you can use properties of strings when enabling the v flag.

const reEmoji = /^\p{RGI_Emoji}$/v;
reEmoji.test('๐Ÿซฑ'); // โ†’ true
reEmoji.test('๐Ÿซฑ๐Ÿฟโ€๐Ÿซฒ๐Ÿป'); // โ†’ true
Resistor answered 19/5, 2023 at 13:0 Comment(1)
Nice catch! I added your answer as a "see also" link of this answer โ€“ Atmospherics

© 2022 - 2024 โ€” McMap. All rights reserved.