How do I remove emoji from string
Asked Answered
S

11

29

My problem is to remove emoji from a string, but not CJK (Chinese, Japanese, Korean) characters from a string using regex. I tried to use this regex:

REGEX = /[^\u1F600-\u1F6FF\s]/i

This regex works fine except it also detects the Chinese, Japanese and Korean character where I need those characters. Any idea how to solve this issue?

Sergias answered 10/7, 2014 at 9:22 Comment(2)
there are a LOT of emoji - maybe it's better to make a blacklist of characters to remove?Verrazano
@Verrazano mostly those Emojis that are in iPhone and Android KeyboardSergias
R
38

Karol S already provided a solution, but the reason might not be clear:

"\u1F600" is actually "\u1F60" followed by "0":

"\u1F60"    # => "ὠ"
"\u1F600"   # => "ὠ0"

You have to use curly braces for code points above FFFF:

"\u{1F600}" #=> "😀"

Therefore the character class [\u1F600-\u1F6FF] is interpreted as [\u1F60 0-\u1F6F F], i.e. it matches "\u1F60", the range "0".."\u1F6F" and "F".

Using curly braces solves the issue:

/[\u{1F600}-\u{1F6FF}]/

This matches (emoji) characters in these unicode blocks:


You can also use unpack, pack, and between? to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions.

s = 'Hi!😀'
#=> "Hi!\360\237\230\200"

s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*')
#=> "Hi!" 

Regarding your Rubular exampleEmoji are single characters:

"😀".length  #=> 1
"😀".chars   #=> ["😀"]

Whereas kaomoji are a combination of multiple characters:

"^_^".length #=> 3
"^_^".chars  #=> ["^", "_", "^"]

Matching these is a very different task (and you should ask that in a separate question).

Reward answered 10/7, 2014 at 10:31 Comment(12)
I have tried the regex that you provided. and this is the link and this is using my regex that I mentioned in the question link. It does not work with your regex, mine is working but has problem like I mentioned in the question.Sergias
@Sergias those are not emoji but kaomojiReward
Owh I see, lets just focus on emoji then. Since I am building this to prevent user to submit emoji from IOS/Android to our server. I know that, they can disable it on keyboard (phone) but still on the server side I need to filter it out. Yes, your regex is working fine for emoji tho, but it does not work for other emoticions like the "houese" "town" "animals" "etc". I have removed the kaomojiSergias
You have to add the appropriate character ranges, e.g. to include Miscellaneous Symbols And Pictographs start with U+1F300: /[\u{1F300}-\u{1F6FF}]/. Karol S already mentioned that.Reward
Yes, correct. I am using that one right now but seems it didnt cater all the emoji or those miscellaneous thing. You know, I am using MAC OSX, you can do like ctrl + cmd + space, then you can get those emojis + etc. Not all of them are cater by /[\u{1F300}-\u{1F6FF}]/ any help? Thank you very muchSergias
@Sergias I've posted a follow-up questionReward
Are you sure about "^_^".length #=> 1?Intricate
@Reward this is a great start but doesn't match all of "🆘🆑🚳🆔🚫" ... I found this solution in Ruby which seems to work well: #16488197Monastery
Stefan your edition didn't work for ruby 1.8.7, the between is not taking the number inside the hex range, so it returns false, and then the emoji is not rejectedHarmonia
@G.I.Joe it does work, 0x1F600 is just another way of writing 128512.Reward
no it doesn't, I ran it even your range is repeating the limitsHarmonia
@G.I.Joe sorry, there was a typo, the upper limit of course has to be 0x1F6FFReward
W
23

I am using one based on this script.

 def strip_emoji(text)
    text = text.force_encoding('utf-8').encode
    clean = ""

    # symbols & pics
    regex = /[\u{1f300}-\u{1f5ff}]/
    clean = text.gsub regex, ""

    # enclosed chars 
    regex = /[\u{2500}-\u{2BEF}]/ # I changed this to exclude chinese char
    clean = clean.gsub regex, ""

    # emoticons
    regex = /[\u{1f600}-\u{1f64f}]/
    clean = clean.gsub regex, ""

    #dingbats
    regex = /[\u{2702}-\u{27b0}]/
    clean = clean.gsub regex, ""
  end

Results:

irb> strip_emoji("👽😀☂❤华み원❤")
=> "华み원"
Winthorpe answered 29/10, 2015 at 7:14 Comment(3)
great answer.. saves my day.. !! :)Barrios
This worked well for me. I created a EmojiStripper concern that uses a before_validation callback to strip emojis from all string fields before validation is executed. That results in all emojis being stripped before it is saved to the DB.Eserine
WARNING: THE CODE IN THIS ANSWER WILL NOT REMOVE ALL EMOJIS. It removes simple 😀emojis fine, but it does not fully remove multi code points emojis correctly, such as 👨‍👩‍👧‍👦or ☸️.Wenger
M
20

This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:

[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]

I generated this regex directly from the raw list of Unicode emoji. The algorithm is here: https://github.com/franklsf95/ruby-emoji-regex.

Example usage:

regex = /[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]/
str = "I am a string with emoji 😍😍😱😱👿👿🐔🌚 and other Unicode characters 比如中文."
str.gsub regex, ''
# "I am a string with emoji  and other Unicode characters 比如中文."

Other Unicode characters, such as Asian characters, are preserved.

EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments for details.

Mutiny answered 18/3, 2015 at 7:1 Comment(2)
Huh, I pasted this into rubular and found that it matched numbers, tooArquebus
@Arquebus Thanks for catching this! I have excluded numbers and other ASCII characters from the emoji. The reason numbers were included is that some emoji are of the form 8⃣ (U+0038 U+20E3). I manually removed those ASCII codes.Mutiny
W
17

Most of the answers in this thread don't remove all emojis correctly. They remove simple emojis like 😀 fine. But they won't fully remove multi code point emojis like 👨‍👩‍👧‍👦 or ☸️, leaving some residual unicode code points behind.

You could use a gem like unicode-emoji to get the latest emoji regexes, but if you find this overkill the following code might be a good enough solution:

text.gsub(/[^[:alnum:][:blank:][:punct:]]/, '').squeeze(' ').strip

This will remove any emoji or weird-unicody-like character that is not a basic alphanum/punct/blank.

Wenger answered 4/10, 2018 at 17:30 Comment(0)
B
11
REGEX = /[^\u{1F600}-\u{1F6FF}\s]/

or

REGEX = /[\u{1F600}-\u{1F6FF}\s]/
REGEX = /[\u{1F600}-\u{1F6FF}]/
REGEX = /[^\u{1F600}-\u{1F6FF}]/

because your original regex seems to indicate you try to find everything that is not an amoji and not a whitespace and I don't know why would you want to do it.

Also:

  • the emoji are 1F300-1F6FF rather than 1F600-1F6FF; you may want to change that

  • if you want to remove all astral characters (for example you deal with a software that doesn't support all of Unicode), you should use 10000-10FFFF.

EDIT: You almost certainly want REGEX = /[\u{1F600}-\u{1F6FF}]/ or similar. Your original regex matched everything that is not a whitespace, and not in range 0-\u1F6F. Since spaces are whitespace, and English letters are in range 0-\u1F6F, and Chinese characters are in neither, the regex matched Chinese characters and removed them.

Becca answered 10/7, 2014 at 9:45 Comment(6)
Thanks for replied, I have tried all your regex in rubular, none of them are working. This is mine link but it has problem that I stated in question...Sergias
Your sample list doesn't contain any emoji, it contains kaomoji. Kaomoji are made from mix of letters and symbols, you can't remove them with a simple regex.Becca
ya my mistake, now I understand how it works... Thanks for your repliedSergias
Any idea why my regex doesn't compile? I'm doing ls | perl -e 'print if /[^\u{1F600}-\u{1F6FF}\s]/' to find filenames containing emoji.Morganne
@SridharSarnobat Assuming your system's locale is UTF-8, you need to tell Perl to use UTF-8 on standard I/O: ls | perl -CSD -ne 'print if /[^\u{1F600}-\u{1F6FF}\s]/'Becca
@KarolS thanks, I need to try this when I get home!Morganne
C
2

Instead of removing Emoji characters, you can only include alphabets and numbers. A simple tr should do the trick, .tr('^A-Za-z0-9', ''). Of course this will remove all punctuation, but you can always modify the regex to suit your specific condition.

Conduct answered 28/8, 2015 at 6:39 Comment(0)
H
1

This very short Regex covers all Emoji in getemoji.com so far:

[\u{1F300}-\u{1F5FF}|\u{1F1E6}-\u{1F1FF}|\u{2700}-\u{27BF}|\u{1F900}-\u{1F9FF}|\u{1F600}-\u{1F64F}|\u{1F680}-\u{1F6FF}|\u{2600}-\u{26FF}]
Helsa answered 10/1, 2018 at 10:43 Comment(1)
Same regexp using \U (for Python, Postgres, etc.): [\U0001F300-\U0001F5FF|\U0001F1E6-\U0001F1FF|\U00002700-\U000027BF|\U0001F900-\U0001F9FF|\U0001F600-\U0001F64F|\U0001F680-\U0001F6FF|\U00002600-\U000026FF]Tarbox
S
1

CARE the answer from Aray have some side effects.

"-".gsub(/[^\p{L}\s]+/, '').squeeze(' ').strip
=> ""

even when this is suppose to be a simple minus (-)

Sordid answered 17/7, 2018 at 13:14 Comment(0)
E
0

I converted the RegEx from the RUBY project above to a JavaScript friendly RegEx:

    /// <summary>
    /// Emoji symbols character sets (added \s and +)
    /// Unicode with עברית Delete the emoji to match 👿
    /// https://regex101.com/r/jP5jC5/3
    /// https://github.com/franklsf95/ruby-emoji-regex
    /// https://mcmap.net/q/478055/-how-do-i-remove-emoji-from-string
    /// </summary>
    public const string Emoji = @"^[\s\u00A9\u00AE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9-\u21AA\u231A-\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA-\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614-\u2615\u2618\u261D\u2620\u2622-\u2623\u2626\u262A\u262E-\u262F\u2638-\u263A\u2648-\u2653\u2660\u2663\u2665-\u2666\u2668\u267B\u267F\u2692-\u2694\u2696-\u2697\u2699\u269B-\u269C\u26A0-\u26A1\u26AA-\u26AB\u26B0-\u26B1\u26BD-\u26BE\u26C4-\u26C5\u26C8\u26CE-\u26CF\u26D1\u26D3-\u26D4\u26E9-\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733-\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763-\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934-\u2935\u2B05-\u2B07\u2B1B-\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299\u1F004\u1F0CF\u1F170-\u1F171\u1F17E-\u1F17F\u1F18E\u1F191-\u1F19A\u1F201-\u1F202\u1F21A\u1F22F\u1F232-\u1F23A\u1F250-\u1F251\u1F300-\u1F321\u1F324-\u1F393\u1F396-\u1F397\u1F399-\u1F39B\u1F39E-\u1F3F0\u1F3F3-\u1F3F5\u1F3F7-\u1F4FD\u1F4FF-\u1F53D\u1F549-\u1F54E\u1F550-\u1F567\u1F56F-\u1F570\u1F573-\u1F579\u1F587\u1F58A-\u1F58D\u1F590\u1F595-\u1F596\u1F5A5\u1F5A8\u1F5B1-\u1F5B2\u1F5BC\u1F5C2-\u1F5C4\u1F5D1-\u1F5D3\u1F5DC-\u1F5DE\u1F5E1\u1F5E3\u1F5EF\u1F5F3\u1F5FA-\u1F64F\u1F680-\u1F6C5\u1F6CB-\u1F6D0\u1F6E0-\u1F6E5\u1F6E9\u1F6EB-\u1F6EC\u1F6F0\u1F6F3\u1F910-\u1F918\u1F980-\u1F984\u1F9C0}]+$";

Usage:

if (!Regex.IsMatch(vm.NameFull, RegExKeys.Emoji)) // Match means no Emoji was found
Enumerate answered 19/8, 2015 at 11:25 Comment(0)
R
0

In Android | Kotlin you can use this extension function to remove all emojis from String

fun String.removeEmojis(): String = Pattern.compile("[^\\p{L}\\s]+")
    .matcher(this).replaceAll("")

Sample :

val result = "Hi emojis 😀 😇 😴🙏 😧 🍉 removed".removeEmojis()
output => "Hi emojis removed"
Roper answered 15/7, 2022 at 0:26 Comment(0)
A
-1
         // method to remove emoji from string
    public static String remove_emoji(String text){
                    String updated_text="";
                    for (int i=0;i<text.length();i++){
                        if(text.substring(i,i+1).matches("[\\x00-\\x7F]+")){
             // regex [\\x00-\\x7F]+ will check it contains emoji symbol or not,if it matches it means its not the emoji symbol            

updated_text=updated_text+text.substring(i,i+1);
                        }
                    }
                    return updated_text;
                }
Avidity answered 31/8, 2021 at 15:35 Comment(1)
Providing more information about why this solves the problem can be a great way to improve your answer and help the users with the same problemJulianajuliane

© 2022 - 2024 — McMap. All rights reserved.