Android - How to filter emoji (emoticons) from a string?

Asked 4/3, 2014 at 17:6 Answered 18/4, 2023 at 11:9

I'm working on an Android app, and I do not want people to use emoji in the input.

How can I remove emoji characters from a string?

Deter answered 4/3, 2014 at 17:6 Comment(4)

Regular expressions are an option. Or if the list of emojis is well known, a simple list that you can iterate through and remove matches in your input would work well. – Karykaryl 4/3, 2014 at 17:10

See #12013841 – Awn 4/3, 2014 at 17:10

You can use Character class https://mcmap.net/q/275464/-check-if-letter-is-emoji/… – Mopey 14/12, 2016 at 16:31

@Mopey That's not what was being asked here. The Character class can indeed recognize surrogate pairs, but that does not mean the character is an emoji. E.g. U+1D120 is not an emoji but is a surrogate pair. – Deter 15/12, 2016 at 2:34

Emojis can be found in the following ranges (source) :

U+2190 to U+21FF
U+2600 to U+26FF
U+2700 to U+27BF
U+3000 to U+303F
U+1F300 to U+1F64F
U+1F680 to U+1F6FF

You can use this line in your script to filter them all at once:

Epicurean answered 16/3, 2014 at 10:27 Comment(4)

this is one potential answer but does not handle all cases. But nonetheless – Buncombe 7/6, 2014 at 1:50

@Buncombe what cases does it not handle? It's not useful to say "this doesn't handle all cases" if you don't have an example. – Stereoscope 26/9, 2014 at 18:2

\u expects 4 digits -- how is this supposed to work for 1f300 etc? – Erickaericksen 24/4, 2017 at 23:28

Not working. In the end I used github.com/vdurmont/emoji-java. For example, removing all emojis: EmojiParser.removeAllEmojis(text); – Eloiseelon 3/12, 2017 at 18:56

Latest emoji data can be found here:

http://unicode.org/Public/emoji/

There is a folder named with emoji version. As app developers a good idea is to use latest version available.

When You look inside a folder, You'll see text files in it. You should check emoji-data.txt. It contains all standard emoji codes.

There are a lot of small symbol code ranges for emoji. Best support will be to check all these in Your app.

Some people ask why there are 5 digit codes when we can only specify 4 after \u. Well these are codes made from surrogate pairs. Usually 2 symbols are used to encode one emoji.

For example, we have a string.

String s = ...;

UTF-16 representation

byte[] utf16 = s.getBytes("UTF-16BE");

Iterate over UTF-16

for(int i = 0; i < utf16.length; i += 2) {

Get one char

char c = (char)((char)(utf16[i] & 0xff) << 8 | (char)(utf16[i + 1] & 0xff));

Now check for surrogate pairs. Emoji are located on the first plane, so check first part of pair in range 0xd800..0xd83f.

if(c >= 0xd800 && c <= 0xd83f) {
    high = c;
    continue;
}

For second part of surrogate pair range is 0xdc00..0xdfff. And we can now convert a pair to one 5 digit code.

else if(c >= 0xdc00 && c <= 0xdfff) {
    low = c;
    long unicode = (((long)high - 0xd800) * 0x400) + ((long)low - 0xdc00) + 0x10000;
}

All other symbols are not pairs so process them as is.

else {
    long unicode = c;
}

Now use data from emoji-data.txt to check if it's emoji. If it is, then skip it. If not then copy bytes to output byte array.

Finally byte array is converted to String by

String out = new String(outarray, Charset.forName("UTF-16BE"));

Minda answered 6/9, 2017 at 2:36 Comment(4)

P.S. If You want to remove some additional symbols, there are Unicode ranges can be found here: jrgraphix.net/research/unicode.php – Minda 3/10, 2017 at 3:39

link seems broken to me :( – Methodism 16/8, 2019 at 19:19

@Methodism Works for me. try to Google "Unicode Character Ranges" – Minda 27/8, 2019 at 10:40

Here's the emoji-data.txt file for version 13.0.0: unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt. For next versions, go to unicode.org/Public > [version] > ucd > emoji – Balmung 28/10, 2019 at 5:4

For those using Kotlin, Char.isSurrogate can help as well. Find and remove the indexes that are true from that.

Sella answered 8/11, 2019 at 18:10 Comment(1)

It wont help if the emoji is composed of more than one, such as skin coloured ones. – Cruickshank 12/1, 2021 at 12:8

Here is what I use to remove emojis. Note: This only works on API 24 and forwards

public  String remove_Emojis_For_Devices_API_24_Onwards(String name)
   {
    // we will store all the non emoji characters in this array list
     ArrayList<Character> nonEmoji = new ArrayList<>();

    // this is where we will store the reasembled name
    String newName = "";

    //Character.UnicodeScript.of () was not added till API 24 so this is a 24 up solution
    if (Build.VERSION.SDK_INT > 23) {
        /* we are going to cycle through the word checking each character
         to find its unicode script to compare it against known alphabets*/
        for (int i = 0; i < name.length(); i++) {
            // currently emojis don't have a devoted unicode script so they return UNKNOWN
            if (!(Character.UnicodeScript.of(name.charAt(i)) + "").equals("UNKNOWN")) {
                nonEmoji.add(name.charAt(i));//its not an emoji so we add it
            }
        }
        // we then cycle through rebuilding the string
        for (int i = 0; i < nonEmoji.size(); i++) {
            newName += nonEmoji.get(i);
        }
    }
    return newName;
}

so if we pass in a string:

remove_Emojis_For_Devices_API_24_Onwards("😊 test 😊 Indic:ढ Japanese:な 😊 Korean:ㅂ");

it returns: test Indic:ढ Japanese:な Korean:ㅂ

Emoji placement or count doesn't matter

Chromatin answered 18/5, 2017 at 20:5 Comment(4)

Really interesting but not perfect. This couldn't filter "❤" and "☤", which resides in dingbats and miscellaneous symbols block. – Acervate 31/8, 2018 at 0:52

@Acervate Those are not emojis – Trabeated 2/4, 2019 at 14:7

@Trabeated YES IT IS. See this. unicode.org/cldr/utility/… – Acervate 2/4, 2019 at 23:59

Emoji spreads around here and there. That's why it's hard to filter all of them. You can find them in General Punctuation, Dingbats, Emoticons, Miscellaneous Symbols, Miscellaneous Symbols And Pictographs, Supplemental Symbols and Pictographs, Transport and Map Symbols blocks. As for "☤", I think that was typo or something but "❤" really is emoji. – Acervate 3/4, 2019 at 0:0

-1

private String removeEmojis(String input) {
    StringBuilder output = new StringBuilder();
    for (int i = 0; i < input.length(); i++) {
        char c = input.charAt(i);
        if (c <= 127) {
            output.append(c);
        }
    }
    return output.toString();
}

Salinometer answered 18/4, 2023 at 11:9 Comment(1)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Ennoble 22/4, 2023 at 4:53

Recommended topics

Hot tags