How can I split a string containing emoji into an array?
Asked Answered
F

7

51

I want to take a string of emoji and do something with the individual characters.

In JavaScript "πŸ˜΄πŸ˜„πŸ˜ƒβ›”πŸŽ πŸš“πŸš‡".length == 13 because "β›”" length is 1, the rest are 2. So we can't do

var string = "πŸ˜΄πŸ˜„πŸ˜ƒβ›”πŸŽ πŸš“πŸš‡";
s = string.split(""); 
console.log(s);
Fao answered 2/7, 2014 at 12:59 Comment(1)
mathiasbynens.be/notes/… – Bezant
D
11

The Grapheme Splitter library by Orlin Georgiev is pretty amazing.

Although it hasn't been updated in a while and presently (Sep 2020) it only supports Unicode 10 and below.

For an updated version of Grapheme Splitter built in Typescript with Unicode 13 support have a look at: https://github.com/flmnt/graphemer

Here is a quick example:

import Graphemer from 'graphemer';

const splitter = new Graphemer();

const string = "πŸ˜΄πŸ˜„πŸ˜ƒβ›”πŸŽ πŸš“πŸš‡";

splitter.countGraphemes(string); // returns 7

splitter.splitGraphemes(string); // returns array of characters

The library also works with the latest emojis.

For example "πŸ‘©πŸ»β€πŸ¦°".length === 7 but splitter.countGraphemes("πŸ‘©πŸ»β€πŸ¦°") === 1.

Full disclosure: I created the library and did the work to update to Unicode 13. The API is identical to Grapheme Splitter and is entirely based on that work, just updated to the latest version of Unicode as the original library hasn't been updated for a couple of years and seems to be no longer maintained.

Disproportionate answered 14/9, 2020 at 17:22 Comment(1)
Until Intl.Segmenter gets Firefox support (caniuse.com/mdn-javascript_builtins_intl_segmenter), I think that this is the best answer. – Fao
F
34

JavaScript ES6 has a solution!, for a real split:

[..."πŸ˜΄πŸ˜„πŸ˜ƒβ›”πŸŽ πŸš“πŸš‡"] // ["😴", "πŸ˜„", "πŸ˜ƒ", "β›”", "🎠", "πŸš“", "πŸš‡"]

Yay? Except for the fact that when you run this through your transpiler, it might not work (see @brainkim's comment). It only works when natively run on an ES6-compliant browser. Luckily this encompasses most browsers (Safari, Chrome, FF), but if you're looking for high browser compatibility this is not the solution for you.

Filberte answered 31/5, 2016 at 2:22 Comment(5)
Babel with es6 settings will transpile this into a call to String's iterator function so it does work in some transpilers. – Dreher
@Dreher I specified that in the answer. It is the fault of the transpiler for not meeting the standard on this – Filberte
Ah, I'm saying it sometimes works. "when you run this through your transpiler, it won't work" implies it never works. It's dependent on what specific emojis are in the string, the transpiler you're using, etc. – Dreher
[...'πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§'] // ["πŸ‘¨", "‍", "πŸ‘¨", "‍", "πŸ‘§", "‍", "πŸ‘§"] – Mccullough
[..."πŸ‘¦πŸΎ"] // ["πŸ‘¦", "🏾"] – Outstation
S
28

With the upcoming Intl.Segmenter. You can do this:

const splitEmoji = (string) => [...new Intl.Segmenter().segment(string)].map(x => x.segment)

splitEmoji("πŸ˜΄πŸ˜„πŸ˜ƒβ›”πŸŽ πŸš“πŸš‡") // ['😴', 'πŸ˜„', 'πŸ˜ƒ', 'β›”', '🎠', 'πŸš“', 'πŸš‡']

This also solve the problem with "πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§" and "πŸ‘¦πŸΎ".

splitEmoji("πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§πŸ‘¦πŸΎ") // ['πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§', 'πŸ‘¦πŸΎ']

According to CanIUse, apart from IE and Firefox, this is supported by 91.23% of users globally, as of writing.

Until Firefox gets support, as mentioned in Matt Davies' answer, Graphemer is the best solution:

let Graphemer = await import("https://cdn.jsdelivr.net/npm/[email protected]/+esm").then(m => m.default.default);
let splitter = new Graphemer();
let graphemes = splitter.splitGraphemes("πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§πŸ‘¦πŸΎ"); // ['πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§', 'πŸ‘¦πŸΎ']
Suppletion answered 25/3, 2022 at 15:22 Comment(3)
looks promising – Clercq
Awesome. This was a drop in replacement swapping let arr = str.split("") with let arr = splitEmoji(str) to split characters in text plus all emoji. – Nunnally
Firefox 125 (released 16.04.2024) has just added support for it – Christophany
F
27

Edit: see Orlin Georgiev's answer for a proper solution in a library: https://github.com/orling/grapheme-splitter


Thanks to this answer I made a function that takes a string and returns an array of emoji:

var emojiStringToArray = function (str) {
  split = str.split(/([\uD800-\uDBFF][\uDC00-\uDFFF])/);
  arr = [];
  for (var i=0; i<split.length; i++) {
    char = split[i]
    if (char !== "") {
      arr.push(char);
    }
  }
  return arr;
};

So

emojiStringToArray("πŸ˜΄πŸ˜„πŸ˜ƒβ›”πŸŽ πŸš“πŸš‡")
// => Array [ "😴", "πŸ˜„", "πŸ˜ƒ", "β›”", "🎠", "πŸš“", "πŸš‡" ]
Fao answered 2/7, 2014 at 12:59 Comment(4)
noting that this won't work for emoji that use zero-width joiners, variation selectors, or the keycap emoji which are digit + keycap + variation selector – Westbound
Just use the match method str.match(/([\uD800-\uDBFF][\uDC00-\uDFFF])/); and it'll return the emojis – Jaffna
I tried you function and it works for me, but look at this: emojiStringToArray("πŸ˜΄πŸ˜„πŸ˜ƒβ›”πŸŽ πŸš“πŸš‡β€οΈβ€οΈβ€οΈβ€οΈβ€οΈβ€οΈ") // => Array [ "😴", "πŸ˜„", "πŸ˜ƒ", "β›”", "🎠", "πŸš“", "πŸš‡", "❀️❀️❀️❀️❀️❀️" ] Do you know how to solve this error? – Indenture
emojiStringToArray( 'πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§' ) // ["πŸ‘¨", "‍", "πŸ‘¨", "‍", "πŸ‘§", "‍", "πŸ‘§"] – Mccullough
P
22

The grapheme-splitter library that does just that, is fully compatible even with old browsers and works not just with emoji but all sorts of exotic characters: https://github.com/orling/grapheme-splitter You are likely to miss edge-cases in any home-brew solution. This one is actually based on the UAX-29 Unicode standart

Pegboard answered 16/3, 2017 at 21:50 Comment(0)
P
13

The modern / proper way to split a UTF8 string is using Array.from(str) instead of str.split('')

Paley answered 14/4, 2020 at 14:45 Comment(2)
This is awesome. By them MDN provides a polyfill for this as well. See: developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/… – Churchwoman
Sadly, this doesn't work as expected with compound ones: Array.from('πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§'); // [ "πŸ‘¨", "‍", "πŸ‘¨", "‍", "πŸ‘§", "‍", "πŸ‘§" ] Array.from('πŸ‘¦πŸΎ'); // [ "πŸ‘¦", "🏾" ] – Fao
D
11

The Grapheme Splitter library by Orlin Georgiev is pretty amazing.

Although it hasn't been updated in a while and presently (Sep 2020) it only supports Unicode 10 and below.

For an updated version of Grapheme Splitter built in Typescript with Unicode 13 support have a look at: https://github.com/flmnt/graphemer

Here is a quick example:

import Graphemer from 'graphemer';

const splitter = new Graphemer();

const string = "πŸ˜΄πŸ˜„πŸ˜ƒβ›”πŸŽ πŸš“πŸš‡";

splitter.countGraphemes(string); // returns 7

splitter.splitGraphemes(string); // returns array of characters

The library also works with the latest emojis.

For example "πŸ‘©πŸ»β€πŸ¦°".length === 7 but splitter.countGraphemes("πŸ‘©πŸ»β€πŸ¦°") === 1.

Full disclosure: I created the library and did the work to update to Unicode 13. The API is identical to Grapheme Splitter and is entirely based on that work, just updated to the latest version of Unicode as the original library hasn't been updated for a couple of years and seems to be no longer maintained.

Disproportionate answered 14/9, 2020 at 17:22 Comment(1)
Until Intl.Segmenter gets Firefox support (caniuse.com/mdn-javascript_builtins_intl_segmenter), I think that this is the best answer. – Fao
G
8

It can be done using the u flag of a regular expression. The regular expression is:

/.*?/u

This is broken every time there are there are at least minimally zero or more characters that may or may not be emojis, but cannot be spaces or new lines break.

  • There are at least minimally zero or more: ? (split in zero chars)
  • Zero or more: *
  • Cannot be spaces or new line break: .
  • May or may not be emojis: /u

By using the question mark ? I am forcing to cut exactly every zero chars, otherwise /.*/u it cuts by all characters until I find a space or newline break.

var string = "πŸ˜΄πŸ˜„πŸ˜ƒβ›”πŸŽ πŸš“πŸš‡"
var c = string.split(/.*?/u)
console.log(c)
Gadroon answered 13/7, 2020 at 3:36 Comment(1)
'πŸ‘¦πŸΎ'.split(/.*?/u); // [ "πŸ‘¦", "🏾" ] – Fao

© 2022 - 2024 β€” McMap. All rights reserved.