How do I remove emoji characters from a string?
Asked Answered
I

1

19

I've got a text input from a mobile device. It contains emoji. In C#, I have the text as

Text 🍫🌐 text

Simply put, I want the output text to be

Text text

I'm trying to just remove all such emojis from the text with rejex.. except, I'm not sure how to convert that emoji into it's unicode sequence.. How do I do that?

edit:

I'm trying to save the user input into mysql. It looks like mysql UTF8 doesn't really support unicode characters and the right way to do it would be by changing the schema but I don't think that is an option for me. So I'm trying to just remove all the emoji characters before saving it in the database.

This is my schema for the relevant column:

enter image description here

I'm using Nhibernate as my ORM and the insert query generated looks like this:

Insert into `Content` (ContentTypeId, Comments, DateCreated) 
values (?p0, ?p1, ?p2);
?p0 = 4 [Type: Int32 (0)]. ?p1 = 'Text 🍫🌐 text' [Type: String (20)], ?p2 = 19/01/2015 10:38:23 [Type: DateTime (0)]

When I copy this query from logs and run it on mysql directly, I get this error:

1 warning(s): 1366 Incorrect string value: '\xF0\x9F\x98\x80 t...' for column 'Comments' at row 1   0.000 sec

Also, I've tried to convert it into encoding bytes and it doesn't really work..

enter image description here

Inbred answered 19/1, 2015 at 11:31 Comment(12)
UTF-8 really should be fine here. Can you post the details of how you're currently trying to save the data, along with your schema information? – Phone
See here: gist.github.com/adamlwatson/9623703 – Wychelm
(Assuming you actually want to remove them, rather than sort your encoding) – Wychelm
@JonSkeet added the info. – Inbred
@Inbred Which version of MySQL are you running on? Seemingly the character set utf8mb4 should make everything tikitiboo... have a read of the answer here #24254485 "It seems that MySQL supports two forms of unicode ucs2 which is 16-bits per character and utf8 up to 3 bytes per character. The bad news is that neither form is going to support plane 1 characters which require at 17 bits. (mainly emoji). It looks like MySQL 5.5.3 and up also support utf8mb4, utf16, and utf32 and supplementary characters (read emoji)" – Trend
You haven't actually shown the code you're using. The error message doesn't seem to fit with the UTF-8 encoding for either of those values, which is odd... – Phone
@JonSkeet yea, I was testing with a few emojis so the message is for another emoji. Also, not sure what you mean by code? I'm using a regular nhibernate repository that saves the object with public virtual String Comments { get; set; } property. The insert query produced is fine, it's just that mysql db can't handle the unicode. – Inbred
@PaulZahra I don't think changing the schema is an option, but will try talk to dba about it! what I need is something like what Octopid has mentioned, but in c#, but I just can't seem to be able to regex the emojis! – Inbred
Something to be aware of from #10993421 "However, note that there are other characters in the Basic Multilingual Plane that are used as emoji by phones but which long predate emoji. For example U+2665 is the traditional Heart Suit character β™₯, but it my be rendered as an emoji graphic on some devices. It's up to you whether you treat this as emoji and try to remove it." – Trend
Octopoid's gist doesn't convert them, it removes them. If you want to just remove any characters not in the BMP, that's reasonably easy. – Phone
@JonSkeet yup - I do want to just remove them! but to remove them I must regex match them and that's where I'm stuck now. – Inbred
"So convert to corresponding \uxxxx characters" is just a red herring? – Phone
P
59

Assuming you just want to remove all non-BMP characters, i.e. anything with a Unicode code point of U+10000 and higher, you can use a regex to remove any UTF-16 surrogate code units from the string. For example:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string text = "x\U0001F310y";
        Console.WriteLine(text.Length); // 4
        string result = Regex.Replace(text, @"\p{Cs}", "");
        Console.WriteLine(result); // 2
    }
}

Here "Cs" is the Unicode category for "surrogate".

It appears that Regex works based on UTF-16 code units rather than Unicode code points, otherwise you'd need a different approach.

Note that there are non-BMP characters other than emoji, but I suspect you'll find they'll have the same problem when you try to store them.

Additionally, not that this won't remove emojis in the BMP, such as U+2764 (red heart). You can use the above as an example of how to remove characters in specific Unicode categories - the category for U+2764 is "other symbol" for example. Now whether you want to remove all "other symbols" is a different matter.

But if really you're interested in just removing surrogate pairs because they can't be stored properly, the above should be fine.

Phone answered 19/1, 2015 at 13:36 Comment(10)
Hi, I made the question to describe what I thought was my problem.. but I tried out your answer and it turns out I don't actually need to convert them.. So I have edited the question now! i.imgur.com/NoQfxud.png Thank you! – Inbred
@LocustHorde: So long as you're aware that you're just throwing away bits of the user's input... – Phone
Yea! this is a temporary solution (hopefully short term!) – Inbred
Hi @JonSkeet, I'm trying to use your Regex to detect if emojis are included in a string (pretty much the exact same code). For some reason \p{Cs} does not catch all emojis. Do you know anything about this by any chance? I've tried about 30 of them and one or two were not detected. I'm assuming they're not in the range of that regex, but i'd like your expert opinion since I know nothing about surrogates and very little about chars in general – Adsorbate
@GilSand: Well, did you look at what Unicode categories those characters are in? It's probably best to ask a new question with a complete example, rather than "one or two of them" (leaving us guessing which). We can then look at what's going on much more easily. – Phone
@JonSkeet You're right. Here's a link to the new question for you or future travelers : https://mcmap.net/q/244628/-detecting-all-emojis – Adsorbate
This won't remove all emojis because some emojis such as ❀ are in the BMP. – Fancywork
@Clement: Thanks for pointing that out; I've added some more text at the end. – Phone
Regex.Replace(str, @"[\p{So}\p{Cs}]", string.Empty) seems to remove additional emojis that are in the BMP – Fancywork
@Clement: Yes, but it will also remove "other symbols" that aren't emojis... e.g. the copyright sign ©. If I were only trying to remove emoji, I wouldn't expect the copyright sign to be removed. – Phone

© 2022 - 2024 β€” McMap. All rights reserved.