Splitting emoji, safely
Asked Answered
G

3

19

I'm attempting to split a string into single words/chars, but I'm having trouble when it comes to emoji.

First of all, I can't simply split the string using an empty character because emojis are generally have length >= 2.

"😎".split("")
["οΏ½", "οΏ½"]

I found an emoji regex that mostly works, but now I am seeing some strange flesh-colored blocks. I even see them show up on twitter in some cases.

enter image description here

Here's a pen that illustrates the problem with the fleshy blocks http://codepen.io/positlabs/pen/QyEOEG?editors=011

enter image description here

UPDATE -----------

Trying out spliddit, and I'm still seeing the issue with the skin tone characters. Is there some way to glue them back together?

http://codepen.io/positlabs/pen/rxLqwL?editors=001

Gharry answered 22/12, 2015 at 18:3 Comment(0)
B
13

JavaScript's strings are UTF-16, so your emoji is internally represented as two code units:

> "\ud83d\ude0e" === "😎"
true

The String.prototype.split function doesn't really care about surrogate pairs in UTF-16, so it naively reverses the individual code units and breaks your emoji, because JavaScript doesn't provide any way to deal with individual characters in strings.

There's no easy way to deal with it. You need a library like spliddit to handle the individual code units properly.

I'm not 100% familiar with the terminology, so please edit my answer as needed.

Bullington answered 22/12, 2015 at 18:21 Comment(5)
Ok, spliddit is nice, but it still fails to re-combine the skin tone characters. I've made a new pen, and will update my question. – Gharry
@positlabs: I don't have time to check it out now, but I'm pretty sure it's codepen acting up. Try deleting all but the flag and the arms and try deleting one of them: codepen.io/anon/pen/NxrOoW?editors=001 – Bullington
@positlabs: Actually, it's just Chrome. My above example works with both Safari and Firefox. Probably a bug. I'll see if there's some workaround. – Bullington
Aha, you're right! It's totally Chrome's fault. I suppose I will just delete those characters, for now. – Gharry
β˜…β˜…β˜…β˜…β˜…β˜†β˜†β˜†β˜†β˜†.length // 10 πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘ŽπŸ‘ŽπŸ‘ŽπŸ‘ŽπŸ‘Ž.length // 20 – Morven
B
3

spliddit can't currently correctly split for example this Hindi text into its 5 characters: "ΰ€…ΰ€¨ΰ₯ΰ€šΰ₯ΰ€›ΰ₯‡ΰ€¦"

You need the grapheme-splitter library: https://github.com/orling/grapheme-splitter It is a full implementation of the UAX-29 Unicode standard and will split even the most exotic letters, emoji being just one of many use cases

Ballman answered 16/3, 2017 at 22:6 Comment(1)
Seconding grapheme-splitter, which helped me solve a similar problem with complex emoji... even keeping an emoji grapheme as complex as πŸš£πŸ½β€β™€οΈ ("Woman Rowing Boat: Medium Skin Tone") together, without splitting out the skin tone or gender marker. – Daguerreotype
C
0

Here's a quick example showing to use the Grapheme Splitter library mentioned in Orlin's answer:

<script src="https://cdn.jsdelivr.net/npm/[email protected]/index.min.js"></script>
<script>
  let splitter = new GraphemeSplitter();
  console.log(splitter.splitGraphemes("πŸŒ·πŸ‘¨πŸΏπŸ³οΈβ€πŸŒˆ")); // ['🌷', 'πŸ‘¨πŸΏ', 'πŸ³οΈβ€πŸŒˆ']
  
  // Compare above output to:
  console.log([..."πŸŒ·πŸ‘¨πŸΏπŸ³οΈβ€πŸŒˆ"]);                 // ['🌷', 'πŸ‘¨', '🏿', '🏳', '️', '‍', '🌈']
  console.log("πŸŒ·πŸ‘¨πŸΏπŸ³οΈβ€πŸŒˆ".split());              // ['\uD83C', '\uDF37', '\uD83D', '\uDC68', '\uD83C', '\uDFFF', '\uD83C', '\uDFF3', '️', '‍', '\uD83C', '\uDF08']
  console.log("πŸŒ·πŸ‘¨πŸΏπŸ³οΈβ€πŸŒˆ".match(/\p{Emoji}/gu)); // ['🌷', 'πŸ‘¨', '🏿', '🏳', '🌈']
</script>

Example: https://jsbin.com/zinegateyi/edit?html,output

This library works great for my purposes.

(Note: I unfortunately couldn't edit this into Orlin's answer due to suggested edit queue being full.)

Ceporah answered 23/6, 2022 at 10:5 Comment(0)

© 2022 - 2024 β€” McMap. All rights reserved.