I have a 1B row table of chat messages with a max-sized VARCHAR column for the raw message text. How do I -
- Select just the rows that contain 1 or more emoji characters
- Efficiently filter out rows without emoji (if needed to improve performance)
I have a 1B row table of chat messages with a max-sized VARCHAR column for the raw message text. How do I -
Combining the knowledge shared by Lukasz and Greg, plus discovering an additional hidden character, I can present this solution:
create or replace function has_emoji(X text)
returns boolean
language javascript
as $$
return /\p{Extended_Pictographic}/u.test(X);
$$;
create or replace function only_emoji(X text)
returns boolean
language javascript
as $$
return /^[\p{Extended_Pictographic}\p{Emoji_Component}]+$/u.test(X);
$$;
Sample use:
with data as (
select $1 t
from values('❄️'),('❄️ is the data ☁️'),('no emoji')
)
select *, has_emoji(t), only_emoji(t)
from data
;
--
Some research:
'❄️'.replace(/\p{Extended_Pictographic}/ug, '')
returns ''
'❄️'.replace(/\p{Extended_Pictographic}/ug, '')
is not equal to ''
Emoji_Component
Using Snowflake's JavaScript User Defined Function:
CREATE OR REPLACE FUNCTION EMOJI_TEST (STR VARCHAR)
RETURNS VARCHAR
LANGUAGE JAVASCRIPT
AS $$
return (/\p{Emoji}/u.test(STR));
$$
;
Query:
WITH cte(col) AS (
SELECT '👌' UNION ALL
SELECT 'text' UNION ALL
SELECT 'text 🐒'
)
SELECT col, EMOJI_TEST(col)
FROM cte
-- WHERE EMOJI_TEST(col)::BOOLEAN
Output:
Related: How to detect emoji using javascript
/\p{Emoji}/u.test('❄️')
testing for emoji works, why not /^\p{Emoji}$/u.test('❄️')
for only emoji? –
Encrimson /^\p{Emoji}/u.test('❄️')
so it is somehow connected with end of string and $
(though that works for exact match /^❄️$/u.test('❄️')
–
Brilliantine \p{Emoji}
tho.. –
Encrimson {Emoji}
it will return true for #, the expression "Flowers 123" etc. It would be more accurate to change the RegEx code to {Extended_Pictographic}
–
Ingratiate © 2022 - 2024 — McMap. All rights reserved.