How do I find and remove emojis in a text file?
Asked Answered
D

4

7

I'm trying to remove all the emojis from a text file I'm parsing using mostly sed and some perl commands, and preferrably store them in a separate file but this isn't necessary.

Can I do this easily with bash or perl? Or should I use another language?

EDIT: Thank you to Cyrus and Barmar for pointing me in the right direction, towards this question. However, it doesn't tell me how to remove just the emojis from the text file. They use the bash line:

grep -P "[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]"  myflie.txt | more

which gets me all the lines containing an emoji.

grep -Pv will remove those lines from the input,

grep -Po will return just the emojis,

grep -Pov returns nothing.

Does anyone know how to remove those specific characters from the text?

Note: I am aware of this question, but my text file is not at all formatted. Emojis are mixed in with the rest of the text.

Dovetailed answered 16/10, 2019 at 21:6 Comment(5)
Welcome to Stack Overflow. SO is a question and answer site for professional and enthusiast programmers. The goal is that you add some code of your own to your question to show at least the research effort you made to solve this yourself.Disjointed
Also --> #45784127 May be helpful.Coparcenary
Possibly related: https://mcmap.net/q/1477243/-how-to-detect-emoji-as-unicode-in-perl/7431860Kriskrischer
@Disjointed Edited the question with more info, is it possible to reopen? Unless it's duplicate again and I'm just really bad at GoogleDovetailed
Based on the edited question, I am voting to reopen this question so someone can provide a solution to remove the emojis instead of grepping them.Wiggins
U
14

2020 UPDATE: Perl v5.32 uses Unicode 13 and supports several properties that deal with emoji. You can use the Emoji property:

Another update: This character class has some surprising matches, including the plain decimal digits, as @anon noted in his comment. See also Why do Unicode emoji property escapes match numbers?.

$ perl -le 'use open qw(:std :utf8); for(1..0xFFFF){ next unless chr() =~ /\p{Emoji}/; printf "%04x %s matches Emoji\n", $_, chr()}'
0023 # matches Emoji
002a * matches Emoji
0030 0 matches Emoji
0031 1 matches Emoji
0032 2 matches Emoji
0033 3 matches Emoji
0034 4 matches Emoji
0035 5 matches Emoji
0036 6 matches Emoji
0037 7 matches Emoji
0038 8 matches Emoji
0039 9 matches Emoji
00a9 © matches Emoji
00ae ® matches Emoji
... many more unsurprising results ...

This means that this program is a bit aggressive:

#!perl
use v5.32;
use utf8;
use open qw(:std :utf8);

while( <<>> ) {  # double diamond (from v5.26) 
    s/\p{Emoji}//g;
    print;
    }

However, Perl v5.18's regex set operations can fix this ((?[ ... ])). Subtract the ASCII range:

while( <DATA> ) {  # double diamond (from v5.26)
    s/(?[ \p{Emoji} - [\001 - \377] ])//g;
    print;
    }

As a one-liner, this turns into:

% perl -CS -pe 's/(?[ \p{Emoji} - [\001 - \377] ])//g' file1 file2 ...

You might want to knock out or include other characters, so you subtract or add those to the set.

Character classes for older Perls

In Perl, removing the emojis can be easy. At its core, this is very close to how you'd do it sed. Update the pattern and other details for your task:

#!perl
use utf8;
use open qw(:std :utf8);

my $pattern = "[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]";

while( <DATA> ) {  # use <> to read from command line
    s/$pattern//g;
    print;
    }

__DATA__
Emoji at end 😀
🗿 Emoji at beginning
Emoji 🙏 in middle

UTS #51 mentions an Emoji property, but it's not listed in perluniprop. Were there such a thing, you would simplify that removing anything with that property:

while( <DATA> ) {
    s/\p{Emoji}//g;
    print;
    }

There is the Emoticon property, but that doesn't cover your character class. I haven't looked to see if it would be the same as the Emoji property in UTS #51.

User-defined Unicode properties

You can make your own properties by defining a subroutine that begins is In or Is followed by the property name you choose. That subroutine returns a potentially multi-lined string where each line is either a single hex code number or two hex code numbers separated by horizontal whitespace. Any character in all of that is then part of your property.

Here's that same character class as a user-defined Unicode property. Note that I use the squiggly heredoc, mostly because I can write the program locally with leading space so I can paste directly into StackOverflow. The lines in IsEmoji cannot have leading space, though, but the indented heredoc takes care of that:

#!perl
use v5.26; # for indented heredoc
use utf8;
use open qw(:std :utf8);

while( <DATA> ) {  # use <> to read from command line
    s/\p{IsEmoji}//g;
    print;
    }

sub IsEmoji { <<~"HERE";
1f300 1f5ff
1f900 1f9ff
1f600 1f64f
1f680 1f6ff
2600 26ff
2700 27bf
1f1e6 1f1ff
1f191 1f251
1f004 1f0cf
1f170 1f171
1f17e 1f17f
1f18e
3030
2b50
2b55
2934 2935
2b05 2b07
2b1b 2b1c
3297
3299
303d
00a9
00ae
2122
23f3
24c2
23e9 23ef
25b6
23f8 23fa
HERE
}

__DATA__
Emoji at end 😀
🗿 Emoji at beginning
Emoji 🙏 in middle

You can put that in a module:

# IsEmoji.pm
sub IsMyEmoji { <<~"HERE";
1f300 1f5ff
...  # all that other stuff too
23f8 23fa
HERE
}

1;

Now you can use that in a one liner (the -I. adds the current directory to the module search path and the -M denotes a module to load):

$ perl -CS -I. -MIsEmoji -pe 's/\p{IsEmoji}//g' file1 file2

Beyond that, you're stuck with the long character class in your one-liner.

Unclad answered 16/10, 2019 at 23:13 Comment(2)
BEWARE: \p{Emoji} filters numbers ([0-9]) too! As there are number emojis and somehow that means the pure number goes into that class too. –.–Patriciapatrician
Thanks. It only took my 18 months to fix this answer.Unclad
C
2

Try this:

1st Method

import emoji
import re

test_list=[]

## function to extract the emojis
def extract_emojis(a_list):
    emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
    r = re.compile('|'.join(re.escape(p) for p in emojis_list))
    aux=[' '.join(r.findall(s)) for s in a_list]
    return(aux)

## Executing function
extract_emojis(test_list)

2nd Method

import re
import sys
def remove_emoji(string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F"  # emoticons
u"\U0001F300-\U0001F5FF"  # symbols & pictographs
u"\U0001F680-\U0001F6FF"  # transport & map symbols
u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
if __name__ == '__main__':

text = open(sys.argv[1]).read()
text = remove_emoji(text)
print(text)
Creese answered 6/9, 2020 at 19:30 Comment(0)
R
2

The following bash script is one example of how you can strip emoji using sed. This requires bash 4.2 or later to support \U (so on macOS, you'll need to brew install bash).

The emoji range is taken from Suhail Gupta's answer and reformatted to make it bash-compatible.

We are using this to strip emoji from a Deliverfile for use with Fastlane, in order to upload to the Apple App Store, which does not allow emoji in a number of fields.

#!/usr/bin/env bash
# ^ use bash from path, not from /bin/bash https://mcmap.net/q/23026/-why-is-usr-bin-env-bash-superior-to-bin-bash
emoji="\U1f300-\U1f5ff\U1f900-\U1f9ff\U1f600-\U1f64f\U1f680-\U1f6ff\U2600-\U26ff\U2700-\U27bf\U1f1e6-\U1f1ff\U1f191-\U1f251\U1f004\U1f0cf\U1f170-\U1f171\U1f17e-\U1f17f\U1f18e\U3030\U2b50\U2b55\U2934-\U2935\U2b05-\U2b07\U2b1b-\U2b1c\U3297\U3299\U303d\U00a9\U00ae\U2122\U23f3\U24c2\U23e9-\U23ef\U25b6\U23f8-\U23fa"
sample="This 🍒 is ⭐ a 🐢 line 🤮 of 😃 emoji ✈"
echo $sample
echo $sample | LC_ALL=UTF-8 sed -e "s/[$(printf $emoji)]//g"

This gives the result:

This 🍒 is ⭐ a 🐢 line 🤮 of 😃 emoji ✈
This  is  a  line  of  emoji

Note how the ✈ character (U+2708) is also stripped, even though it does not look like a coloured emoji. Adding the variation selector U+FE0F will turn this into an emoji-styled ✈️ on systems that support it. You may want to tweak your regex to only strip colourful emoji characters, depending on your circumstances.

Ratcliffe answered 11/5, 2021 at 23:52 Comment(1)
Hi thanks for your answer it works correctly! Will just add that this site: rapidtables.com/convert/number/ascii-to-hex.html can convert emojis to hex, so if someone in the future wants to add some new emoji to the list it can.Chimera
C
-2

You can remove whole emojis table ( https://apps.timwhitlock.info/emoji/tables/unicode )

perl -e '$t=pack("H*", "f09f9889"); print$t,$/; $t=~s/\xF0\x9F[\x98-\x99][\x81-\x8F]//; print$t,$/'
Conundrum answered 8/1, 2022 at 23:0 Comment(1)
The perl script does not remove anything. It just prints out a smilie.Patriciapatrician

© 2022 - 2024 — McMap. All rights reserved.