Node.js Emoji Parsing
Asked Answered
A

2

8

I'm trying to parse an incoming string to determine whether it contains any non-emojis.

I've gone through this great article by Mathias and am leveraging both native punycode for the encoding / decoding and regenerate for the regex generation. I'm also using EmojiData to get my dictionary of emojis.

With that all said, certain emojis continue to be pesky little buggers and refuse to match. For certain emoji, I continue to get a pair of code points.

// Example of a single code point:
console.log(punycode.ucs2.decode('💩'));
>> [ 128169 ]

// Example of a paired code point:
console.log(punycode.ucs2.decode('⌛️'));
>> [ 8987, 65039 ]

Mathias touches on this in his article (and gives an example of punycode working around this) but even using his example I get an incorrect response:

function countSymbols(string) {
  return punycode.ucs2.decode(string).length;
}
console.log(countSymbols('💩'));
>> 1
console.log(countSymbols('⌛️'));
>> 2

What is the best way to detect whether a string contains all emojis or not? This is for a proof of concept so the solution can be as brute force as need be.

--- UPDATE ---

A little more context on my pesky emoji above.

These are visually identical but in fact different unicode values (the second one is from the example above):

⌛ // \u231b

⌛️ // \u231b\ufe0f

The first one works great, the second does not. Unfortunately, the second version is what iOS seems to use (if you copy and paste from iMessage you get the second one, and when receiving a text from Twilio, same thing).

Accusatory answered 24/9, 2015 at 21:25 Comment(2)
So it would appear that combining marks (that extra bit of unicode on the second example) are what's tripping things up here. I'm looking into how to best get rid of these elements from my string.Accusatory
If anyone ever runs into a similar use case, I packaged this all up into an npm module: github.com/scottlabs/emojiExistsAccusatory
D
4

The U+FE0F is not a combining mark, it's a variation sequence that controls the rendering of the glyph (see this answer). Removing such sequences may change the appearance of the character, for example: U+231B+U+FE0E (⌛︎).

Also, emoji sequences can be made from multiple code points. For example, U+0032 (2) is not an emoji by itself, but U+0032+U+20E3 (2⃣) or U+0032+U+20E3+U+FE0F (2⃣️) is—but U+0041+U+20E3 (A⃣) isn't. A complete list of emoji sequences are maintained in the emoji-data.txt file by the Unicode Consortium (the emoji-data-js library appears to have this information).

To check if a string contains emoji characters, you will need to test if any single character is in emoji-data.txt, or starts a substring for a sequence in it.

Desmoid answered 24/9, 2015 at 23:26 Comment(1)
Thanks for your help. I'm now first looking for pairs of code points first, followed by individual code points and that's working for my use case.Accusatory
F
0

If, hypothetically, you know what non-emoji characters you expect to run into, you can use a little lodash magic via their toArray or split modules, which are emoji aware. For example, if you want to see if a string contains alphanumeric characters, you could write a function like so:

function containsAlphaNumeric(string){
 return _(string).toArray().filter(function(char){
    return char.match(/[a-zA-Z0-9]/);
 }).value().length > 0 ? true : false;
}
Flocculus answered 6/9, 2017 at 22:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.