How to split emoji from each other python?
Asked Answered
S

4

8

I need to split emoji from each other for example

EM = 'Hey 😷😷😷'
EM.split()

If we split it we will have

['Hey' ,'😷😷😷']

I want to have

['hey' , '😷' , '😷' , '😷']

and I want it to be applied to all emojis.

Shoran answered 19/4, 2018 at 12:56 Comment(3)
Emojis are hard! I don't know if there is some readily available library to deal with this in particular, but take a look at this answer about flags for a bit of reference. – Dorsey
list(EM) and then deal with that maybe ? – Reglet
list(chain.from_iterable(x if x.isalpha() else list(x) for x in EM.split())) – Subtraction
J
18

You should be able to use get_emoji_regexp from the https://pypi.org/project/emoji/, together with the usual split function . So something like:

import functools
import operator
import re

import emoji

em = 'Hey 😷😷😷'
em_split_emoji = emoji.get_emoji_regexp().split(em)
em_split_whitespace = [substr.split() for substr in em_split_emoji]
em_split = functools.reduce(operator.concat, em_split_whitespace)

print(em_split)

outputs:

['Hey', '😷', '😷', '😷']

A more complex case, with family, skin tone modifiers, and a flag:

em = 'Hey πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘§πŸ‘¨πŸΏπŸ˜·πŸ˜·πŸ‡¬πŸ‡§'
em_split_emoji = emoji.get_emoji_regexp().split(em)
em_split_whitespace = [substr.split() for substr in em_split_emoji]
em_split = functools.reduce(operator.concat, em_split_whitespace)

for separated in em_split:
    print(separated)

outputs:

Hey
πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘§
πŸ‘¨πŸΏ
😷
😷
πŸ‡¬πŸ‡§

(I think something's up with using print on a list with the family emoji, hence printing each item of the list separately. Printing family emoji, with U+200D zero-width joiner, directly, vs via list)

Jodiejodo answered 19/4, 2018 at 21:44 Comment(3)
Nice! Does this deal with complex composite emojis? (like emojis with skn color, families, national flags, ...) – Dorsey
@jdehesa Yes, I think so. I have added details to the answer. – Jodiejodo
@MichalCharemza very helpful. Can you please tell me if I can simply add a space between emojis instead of creating a new list? – Sympetalous
C
0
from tokenize import tokenize, NUMBER, STRING, NAME
from io import BytesIO
sample_text = x = 'Hey 12  πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘§πŸ‘¨πŸΏπŸ˜·πŸ˜·πŸ‡¬πŸ‡§'
buffer = BytesIO(sample_text.encode("UTF-8"))
result = []
token = tokenize(buffer.readline)
for token_identity, token_value, *extra in token:
    if token_identity in (NAME, NUMBER, STRING):
        result.append(token_value)
sample_text_de_emoji = " ".join(result)
Castello answered 10/1, 2021 at 6:19 Comment(0)
S
-2

If the Emoji is 4 bytes, the first byte is hex Fx. Regexp: f[0-7]
If the Emoticon is 3 bytes, the first byte is hex Ex. e[0-9a-f]

This is where 'x' is some other hex digit.

Examples:

😁 is hex f0 9f 98 81
☺ is hex e2 98 ba

After the first hex byte, the other bytes are something that matches this regexp: [89ab][0-9a-f]

Sinegold answered 4/5, 2018 at 6:9 Comment(0)
F
-4

Seems like emojis are 4 bytes long, you can simply cut your string every 4. Here's some code for you:

text = 'Hey \xf0\x9f\x98\xb7\xf0\x9f\x98\xb7\xf0\x9f\x98\xb7'

print text
print 'text.split()=%s' % text.split()

emojis_str = text.split()[1]
emojis_list = [emojis_str[i:i+4] for i in range(0, len(emojis_str), 4)]

print 'emojis_list=%s' % emojis_list

for em in emojis_list:
    print 'emoji: %s' % em

will output

$ python em.py
Hey 😷😷😷
text.split()=['Hey', '\xf0\x9f\x98\xb7\xf0\x9f\x98\xb7\xf0\x9f\x98\xb7']
emojis_list=['\xf0\x9f\x98\xb7', '\xf0\x9f\x98\xb7', '\xf0\x9f\x98\xb7']
emoji: 😷
emoji: 😷
emoji: 😷
$
Fizz answered 19/4, 2018 at 13:16 Comment(2)
Not a dependable method at all. – Mahican
I think emojis are not all 4 bytes long: they can vary tremendously. – Jodiejodo

© 2022 - 2024 β€” McMap. All rights reserved.