replace emoji unicode symbol using regexp in javascript
Asked Answered
A

9

15

As you all know emoji symbols are coded up to 3 or 4 bytes, so it may occupy 2 symbols in my string. For example '๐Ÿ˜wew๐Ÿ˜'.length = 7 I want to find those symbols in my text and replace them to the value that is dependent from its code. Reading SO, I came up to XRegExp library with unicode plugin, but have not found the way how to make it work.

var str = '๐Ÿ˜wew๐Ÿ˜';// \u1F601 symbol
var reg = XRegExp('[\u1F601-\u1F64F]', 'g'); //  /[แฝ 1-แฝคF]/g -doesn't make a lot of sense  
//var reg = XRegExp('[\uD83D\uDE01-\uD83D\uDE4F]', 'g'); //Range out of order in character class
//var reg = XRegExp('\\p{L}', 'g'); //doesn't match my symbols
console.log(XRegExp.replace(str, reg, function(match){
   return encodeURIComponent(match);// here I want to have smth like that %F0%9F%98%84 to be able to map anything I want to this value and replace to it
}));

jsfiddle

I really don't want to bruteforce the string looking for the sequence of characters from my range. Could someone help me to find the way to do that with regexp's.

EDITED Just came up with an idea of enumerating all the emoji symbols. Better than brutforce but still looking for the better idea

var reg = XRegExp('\uD83D\uDE01|\uD83D\uDE4F|...','g');
Aggravate answered 25/2, 2014 at 6:21 Comment(5)
Why are you trying to match the bytes rather than the codepoints? The example you have using '[\u1F601-\u1F64F]' is the correct way to match these points (although the block is U+1F300-U+1F5FF). โ€“ Modena
Not only bytes, I tried many ways, but maybe I did it wrong. What would be the regexp with those codepoints? XRegExp('[\u1F300-\u1F5FF]', 'g');? โ€“ Aggravate
@ไธ€ไบŒไธ‰ Javascript does not support characters beyond U+FFFF natively. \u1F601 in a Javascript string encodes two characters, U+1F60 followed by ASCII '1'. There's no way to use U+1F601 in a character class. โ€“ Exaggerate
Regex /[\uD800-\uDBFF][\uDC00-\uDFFF]/g solved my problem. It includes not only emojis but also special characters. Referred #3745221 โ€“ Decibel
tempted to close as dup of https://mcmap.net/q/235982/-how-to-detect-emoji-using-javascript/11107541 โ€“ Hagioscope
I
13

The \u.... notation has four hex digits, no less, no more, so it can only represent code points up to U+FFFF. Unicode characters above that are represented as pairs of surrogate code points.

So some indirect approach is needed. Cf. to JavaScript strings outside of the BMP.

For example, you could look for code points in the range [\uD800-\uDBFF] (high surrogates), and when you find one, check that the next code point in the string is in the range [\uDC00-\uDFFF] (if not, there is a serious data error), interpret the two as a Unicode character, and replace them by whatever you wish to put there. This looks like a job for a simple loop through the string, rather than a regular expression.

Insist answered 25/2, 2014 at 7:39 Comment(1)
Thx. But that is almost what I came to in my edited version of the question. I really want to avoid loops, cause I'm working with my string each time it is changed. But you pushed me to idea to use XRegExp('[\uD800-\uDBFF][\uDC00-\uDFFF]','g') That would be pretty enough for me, I guess. โ€“ Aggravate
S
11

This is somewhat old, but I was looking into this problem and it seems Bradley Momberger has posted a nice solution to it here: http://airhadoken.github.io/2015/04/22/javascript-string-handling-emoji.html

The regex he proposes is:

/[\uD800-\uDFFF]./ // This matches emoji

This regex matches the head surrogate, which is used by emojis, and the charracter following the head surrogate (which is assumed to be the tail surrogate). Thus, all emojis should be matched correctly and with

.replace(/[\uD800-\uDFFF]./g,'')

you should be able to remove all emojis.

Edit: Better regex found. The above regex misses some emojis.

But there is a reddit post with a version, for which i cannot find an emoji, that is excepted from the rule. The reddit is here: https://www.reddit.com/r/tasker/comments/4vhf2f/how_to_regex_emojis_in_tasker_for_search_match_or/ And the regex is:

/[\uD83C-\uDBFF\uDC00-\uDFFF]+/

To match all occurences, use the g modifier:

/[\uD83C-\uDBFF\uDC00-\uDFFF]+/g

Second Edit: As CodeToad pointed out correctly, โœจ is not recognized by the above Regex, because it's in the dingbats block (thanks to air_hadoken).

The lodash library came up with an excellent Emoji Regex block:

(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff])[\ufe0e\ufe0f]?(?:[\u0300-\u036f\ufe20-\ufe23\u20d0-\u20f0]|\ud83c[\udffb-\udfff])?(?:\u200d(?:[^\ud800-\udfff]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff])[\ufe0e\ufe0f]?(?:[\u0300-\u036f\ufe20-\ufe23\u20d0-\u20f0]|\ud83c[\udffb-\udfff])?)*

Kevin Scott nicely put together, what this regex covers in his Blog Post. Spoiler: it includes dingbats ๐ŸŽ‰

Sava answered 23/11, 2016 at 11:34 Comment(3)
this is the best one I tested so far. though it misses this emoji :โœจ โ€“ Maltase
@Maltase โœจ ("sparkles") is from the dingbats block, which can be represented in UTF-16 without a surrogate pair. If you wanted to catch those as well, you'd need to check for /[\u2700-\u27BF][\uFE0E-\uFE0F]?/ (the latter range is for a possible variant selector) โ€“ Ec
Edited to new Regex from lodash, which also includes the dingbats block. โ€“ Sava
E
8

maybe you can take a look of this article: http://crocodillon.com/blog/parsing-emoji-unicode-in-javascript

the emoji unicode from \u1F601 to \u1F64F

translate to javascript's utf-16 is \ud83d\ude00 to \ud83d\ude4f

the first char is always \ud83d.

so the reg is out:

/\ud83d[\ude00-\ude4f]/g

hope this can make some help

Extravert answered 2/7, 2015 at 2:48 Comment(1)
That's work nicely at the end of 2021, thanks โ€“ Insula
L
5
  1. /\ud83d[\ude00-\ude4f]/g

not including all emojis like : ๐Ÿ‘ฟ ๐Ÿ‘น ๐Ÿ‘บ ๐Ÿ’€ ๐Ÿ‘ป ๐Ÿ‘ฝ ๐Ÿค– ๐Ÿ’ฉ, see http://getemoji.com/ and try your regex https://regex101.com/

  1. /[\uD83C-\uDBFF\uDC00-\uDFFF]+/g

not including all emojis like : โ›‘ โ˜•๏ธ โ˜๏ธโ˜„๏ธ โ˜€๏ธโ˜ƒ๏ธ โ›„๏ธ โ„๏ธ โ˜น๏ธโ˜บ๏ธโ›ฉโ›ฑโ„ข๏ธ ยฉ๏ธ ยฎ๏ธ ใ€ฐ๏ธ โžฐ โžฟ

  1. Even this regex does not allow you to remove all emojis... ๐Ÿ–ฅ ๐Ÿ–จ ๐Ÿ–ฑ ๐Ÿ–ฒ ๐Ÿ•น ๐Ÿ—œ :

https://github.com/nizaroni/emoji-strip/blob/master/dist/emoji-strip.js#L79

Then, can you say why you think these regex is bad to remove all exotic characters and emojis ?

/[\u1000-\uFFFF]+/g
Landwehr answered 2/3, 2018 at 15:57 Comment(0)
L
2

To remove all possible emojis:

new RegExp('[\u1000-\uFFFF]+', 'g');
Landwehr answered 1/3, 2018 at 12:42 Comment(2)
Can you say why -1 ? โ€“ Landwehr
This regex is usefull to remove all exotic Characters as emojis, including those of foreign languages. I'm french developper, I would only have UTF-8 characteres for web txt. What characters could this regex remove in excess? โ€“ Landwehr
L
0

Below regex pattern worked for me in java.

"[\ud83c\udc00-\ud83c\udfff]|[\ud83d\udc00-\ud83d\udfff]|[\u2600-\u27ff]"

As java String uses UTF-16 encoding and as emoji's are above 0xFFFF as well, this regex pattern consider surrogate pairs to identify emojis.

Lehrer answered 6/9, 2016 at 11:45 Comment(0)
L
0

For fun : Solution to remove special characters without using regexp

const str = "abcdefgehijkz ะ ะฐะท, ะดะฒะฐ ั‚ั€ะธ! 1234567809 -ab A Z & รฉ รจ ร– รข ๐Ÿ˜€ ๐Ÿ˜ ๐Ÿ˜‚ ๐Ÿคฃ ๐Ÿ˜ƒ ๐Ÿ˜„ ๐Ÿ˜… ๐Ÿ˜† ๐Ÿ˜‰ ๐Ÿ˜Š ๐Ÿ˜‹ ๐Ÿ˜Ž ๐Ÿ˜ ๐Ÿ˜˜ ๐Ÿฅฐ ๐Ÿ˜— ๐Ÿ˜™ ๐Ÿ˜š โ˜บ๏ธ ๐Ÿ™‚ ๐Ÿค— ๐Ÿคฉ ๐Ÿค” ๐Ÿคจ ๐Ÿ˜ ๐Ÿ˜‘ ๐Ÿ˜ถ ๐Ÿ™„ ๐Ÿ˜ ๐Ÿ˜ฃ ๐Ÿ˜ฅ ๐Ÿ˜ฎ ๐Ÿค ๐Ÿ˜ฏ ๐Ÿ˜ช ๐Ÿ˜ซ ๐Ÿ˜ด ๐Ÿ˜Œ ๐Ÿ˜› ๐Ÿ˜œ ๐Ÿ˜ ๐Ÿคค ๐Ÿ˜’ ๐Ÿ˜“ ๐Ÿ˜” ๐Ÿ˜• ๐Ÿ™ƒ ๐Ÿค‘ ๐Ÿ˜ฒ โ˜น๏ธ ๐Ÿ™ ๐Ÿ˜– ๐Ÿ˜ž ๐Ÿ˜Ÿ ๐Ÿ˜ค ๐Ÿ˜ข ๐Ÿ˜ญ ๐Ÿ˜ฆ ๐Ÿ˜ง ๐Ÿ˜จ ๐Ÿ˜ฉ ๐Ÿคฏ ๐Ÿ˜ฌ ๐Ÿ˜ฐ ๐Ÿ˜ฑ ๐Ÿฅต ๐Ÿฅถ ๐Ÿ˜ณ ๐Ÿคช ๐Ÿ˜ต ๐Ÿ˜ก ๐Ÿ˜  ๐Ÿคฌ ๐Ÿ˜ท ๐Ÿค’ ๐Ÿค• ๐Ÿคข ๐Ÿคฎ ๐Ÿคง ๐Ÿ˜‡ ๐Ÿค  ๐Ÿคก ๐Ÿฅณ ๐Ÿฅด ๐Ÿฅบ ๐Ÿคฅ ๐Ÿคซ ๐Ÿคญ ๐Ÿง ๐Ÿค“ ๐Ÿ˜ˆ ๐Ÿ‘ฟ ๐Ÿ‘น ๐Ÿ‘บ ๐Ÿ’€ ๐Ÿ‘ป ๐Ÿ‘ฝ ๐Ÿค– ๐Ÿ’ฉ ๐Ÿ˜บ ๐Ÿ˜ธ ๐Ÿ˜น ๐Ÿ˜ป ๐Ÿ˜ผ ๐Ÿ˜ฝ ๐Ÿ™€ ๐Ÿ˜ฟ ๐Ÿ˜พ-axxb-"


/********* with regExp ***********/
let startTime = new Date().getTime();
let resp = str.replace(new RegExp('[\u00FF-\uFFFF]+','g'), '');
console.log(resp);
console.log(new Date().getTime() - startTime);


/********* without regExp ***********/
startTime = new Date().getTime();
resp = Array.from(str, x => {
  let theUnicode = x.charCodeAt(0).toString(16);
  while (theUnicode.length < 4) {
    theUnicode = '0' + theUnicode;
  }
  if (theUnicode < '00ff') { 
    return x;
  }
}).join('');
console.log(resp);
console.log(new Date().getTime() - startTime);
Landwehr answered 12/5, 2020 at 18:43 Comment(0)
F
-3

emoji's in range of U+1F600 to U+1F64F

you can use this line in your script for sending with Json:

text.replace(/[\u1F60-\u1F64]|[\u2702-\u27B0]|[\u1F68-\u1F6C]|[\u1F30-\u1F70]{\u2600-\u26ff]/g, "");
Felicitous answered 16/3, 2014 at 11:13 Comment(0)
A
-3

May be you should use replace in such way?

reg = str.replace(new RegExp('๐Ÿ˜Š','g'),'');

Try out https://github.com/iLeonidze/emoji.js

Adversaria answered 23/4, 2014 at 4:26 Comment(1)
The correct solution should handle the character range of emojis, not just one. โ€“ Tuesday

© 2022 - 2024 โ€” McMap. All rights reserved.