How do I detect an emoji in a Snowflake VARCHAR?
Asked Answered
G

2

7

I have a 1B row table of chat messages with a max-sized VARCHAR column for the raw message text. How do I -

  1. Select just the rows that contain 1 or more emoji characters
  2. Efficiently filter out rows without emoji (if needed to improve performance)
Galatians answered 11/11, 2021 at 16:45 Comment(1)
Welcome to Stack Overflow. Please take the 2-minute tour. Moreover, open Help center and read at least How to Ask. Then, edit your question to provide a minimal reproducible example. Questions that only ask for code are too broad and are likely to be put on hold or closed.Organza
E
6

Combining the knowledge shared by Lukasz and Greg, plus discovering an additional hidden character, I can present this solution:

  • 1 JS UDF that detects the presence of emojis.
  • 1 JS UDF that detects if the string is only emojis.
create or replace function has_emoji(X text)
returns boolean
language javascript
as $$
return /\p{Extended_Pictographic}/u.test(X);
$$;

create or replace function only_emoji(X text)
returns boolean
language javascript
as $$
return /^[\p{Extended_Pictographic}\p{Emoji_Component}]+$/u.test(X);
$$;

Sample use:

with data as (
  select $1 t
  from values('❄️'),('❄️ is the data ☁️'),('no emoji')
)

select *, has_emoji(t), only_emoji(t)
from data
;

enter image description here

--

Some research:

  • '❄️'.replace(/\p{Extended_Pictographic}/ug, '') returns ''
  • but '❄️'.replace(/\p{Extended_Pictographic}/ug, '') is not equal to ''
  • Turns out there are hidden characters that still need to be replaced, and matched by Emoji_Component

enter image description here

Encrimson answered 12/11, 2021 at 0:18 Comment(0)
B
0

Using Snowflake's JavaScript User Defined Function:

CREATE OR REPLACE FUNCTION EMOJI_TEST (STR VARCHAR)
  RETURNS VARCHAR
  LANGUAGE JAVASCRIPT
  AS $$
  return (/\p{Emoji}/u.test(STR));
  $$
  ;

Query:

WITH cte(col) AS (
    SELECT '👌'   UNION ALL
    SELECT 'text' UNION ALL
    SELECT 'text 🐒'
)
SELECT col, EMOJI_TEST(col)
FROM cte
-- WHERE EMOJI_TEST(col)::BOOLEAN

Output:

enter image description here

Related: How to detect emoji using javascript

Brilliantine answered 11/11, 2021 at 17:18 Comment(7)
That works, but I'm having trouble detecting only emoji. If /\p{Emoji}/u.test('❄️') testing for emoji works, why not /^\p{Emoji}$/u.test('❄️') for only emoji?Encrimson
@FelipeHoffa It is strange ,the beginning of string works /^\p{Emoji}/u.test('❄️') so it is somehow connected with end of string and $(though that works for exact match /^❄️$/u.test('❄️')Brilliantine
Indeed... I had an answer similar to this one, but then got stuck with the "only emoji" requirementEncrimson
@FelipeHoffa Maybe you could rewrite it as Java UDF :)Brilliantine
JS makes it so easy with \p{Emoji} tho..Encrimson
Although the OP is asking about "emoji" the specification has a lot of things in it that most people don't think of as emojis: unicode.org/Public/emoji/11.0/emoji-data.txt. For example using {Emoji} it will return true for #, the expression "Flowers 123" etc. It would be more accurate to change the RegEx code to {Extended_Pictographic}Ingratiate
Extended_Pictographic + Emoji_Component finally solved the mystery... https://mcmap.net/q/1750969/-how-do-i-detect-an-emoji-in-a-snowflake-varcharEncrimson

© 2022 - 2024 — McMap. All rights reserved.