I need to split emoji from each other for example
EM = 'Hey π·π·π·'
EM.split()
If we split it we will have
['Hey' ,'π·π·π·']
I want to have
['hey' , 'π·' , 'π·' , 'π·']
and I want it to be applied to all emojis.
I need to split emoji from each other for example
EM = 'Hey π·π·π·'
EM.split()
If we split it we will have
['Hey' ,'π·π·π·']
I want to have
['hey' , 'π·' , 'π·' , 'π·']
and I want it to be applied to all emojis.
You should be able to use get_emoji_regexp
from the https://pypi.org/project/emoji/, together with the usual split
function . So something like:
import functools
import operator
import re
import emoji
em = 'Hey π·π·π·'
em_split_emoji = emoji.get_emoji_regexp().split(em)
em_split_whitespace = [substr.split() for substr in em_split_emoji]
em_split = functools.reduce(operator.concat, em_split_whitespace)
print(em_split)
outputs:
['Hey', 'π·', 'π·', 'π·']
A more complex case, with family, skin tone modifiers, and a flag:
em = 'Hey π¨βπ©βπ§βπ§π¨πΏπ·π·π¬π§'
em_split_emoji = emoji.get_emoji_regexp().split(em)
em_split_whitespace = [substr.split() for substr in em_split_emoji]
em_split = functools.reduce(operator.concat, em_split_whitespace)
for separated in em_split:
print(separated)
outputs:
Hey
π¨βπ©βπ§βπ§
π¨πΏ
π·
π·
π¬π§
(I think something's up with using print
on a list with the family emoji, hence printing each item of the list separately. Printing family emoji, with U+200D zero-width joiner, directly, vs via list)
from tokenize import tokenize, NUMBER, STRING, NAME
from io import BytesIO
sample_text = x = 'Hey 12 π¨βπ©βπ§βπ§π¨πΏπ·π·π¬π§'
buffer = BytesIO(sample_text.encode("UTF-8"))
result = []
token = tokenize(buffer.readline)
for token_identity, token_value, *extra in token:
if token_identity in (NAME, NUMBER, STRING):
result.append(token_value)
sample_text_de_emoji = " ".join(result)
If the Emoji is 4 bytes, the first byte is hex Fx. Regexp: f[0-7]
If the Emoticon is 3 bytes, the first byte is hex Ex. e[0-9a-f]
This is where 'x' is some other hex digit.
Examples:
π is hex f0 9f 98 81
βΊ is hex e2 98 ba
After the first hex byte, the other bytes are something that matches this regexp: [89ab][0-9a-f]
Seems like emojis are 4 bytes long, you can simply cut your string every 4. Here's some code for you:
text = 'Hey \xf0\x9f\x98\xb7\xf0\x9f\x98\xb7\xf0\x9f\x98\xb7'
print text
print 'text.split()=%s' % text.split()
emojis_str = text.split()[1]
emojis_list = [emojis_str[i:i+4] for i in range(0, len(emojis_str), 4)]
print 'emojis_list=%s' % emojis_list
for em in emojis_list:
print 'emoji: %s' % em
will output
$ python em.py
Hey π·π·π·
text.split()=['Hey', '\xf0\x9f\x98\xb7\xf0\x9f\x98\xb7\xf0\x9f\x98\xb7']
emojis_list=['\xf0\x9f\x98\xb7', '\xf0\x9f\x98\xb7', '\xf0\x9f\x98\xb7']
emoji: π·
emoji: π·
emoji: π·
$
© 2022 - 2024 β McMap. All rights reserved.
list(chain.from_iterable(x if x.isalpha() else list(x) for x in EM.split()))
β Subtraction