JavaScript substring without splitting emoji
Asked Answered
A

3

13

in my js I am trying to substring() text which generally works but unfortunately decapitates emojis.

usaText = "A๐Ÿ‡บ๐Ÿ‡ธZ"
splitText = usaText.substring(0,2) //"A๏ฟฝ"
splitText = usaText.substring(0,3) //"A๐Ÿ‡บ"
splitText = usaText.substring(0,4) //"A๐Ÿ‡บ๏ฟฝ"
splitText = usaText.substring(0,5) //"A๐Ÿ‡บ๐Ÿ‡ธ"

Is there a way to use substring without breaking emoji? In my production code I cut at about 40 characters and I wouldn't mind if it was 35 or 45. I have thought about simply checking whether the 40th character is a number or between a-z but that wouldn't work if you got a text full of emojis. I could check whether the last character is one that "ends" an emoji by pattern matching but this also seems a bit weird performance-wise.

Am I missing something? With all the bloat that JavaScript carries, is there no built-in count that sees emoji as one?

To the Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters") thing:

chrs = Array.from( usaText )
(4) ["A", "๐Ÿ‡บ", "๐Ÿ‡ธ", "Z"]
0: "A"
1: "๐Ÿ‡บ"
2: "๐Ÿ‡ธ"
3: "Z"
length: 4

That's one too many unfortunately.

Albertoalberts answered 26/9, 2018 at 22:13 Comment(3)
You might consider looking for emojis, log where they are, then remove them. Then do the substring, then put the emojis into the substrings based on where they were in the original string. The substrings won't be the same length anymore, but you say that isn't an issue. โ€“ Racing
Forget about "emoji", you're asking about surrogate pair UTF-16, applying to normal languages as much as they do to emoji. There is an elegant solution for this, already answered over on #21397816, consisting of using Array.from(yourstring), which will split your string into individual unicode characters without breaking them between bytes. โ€“ Swanherd
Please check my code. I did try that already and while it made my situation a bit better it still leaves me with 2 parts. โ€“ Albertoalberts
O
12

So this isn't really an easy thing to do, and I'm inclined to tell you that you shouldn't write this on your own. You should use a library like runes.

Just a simple npm i runes, then:

const runes = require('runes');
const usaText = "A๐Ÿ‡บ๐Ÿ‡ธZ";
runes.substr(usaText, 0, 2); // "A๐Ÿ‡บ๐Ÿ‡ธ"
Odaniel answered 26/9, 2018 at 23:25 Comment(2)
The runes code also is simply-written enough that it makes a very good introduction to the major grapheme cluster splitting problems. I highly recommend reading both the code and the test cases. github.com/dotcypress/runes/blob/develop/index.js โ€“ Particular
runes(usaText) -> (3)ย ["A", "๐Ÿ‡บ๐Ÿ‡ธ", "Z"]. Perfect, thanks! โ€“ Albertoalberts
G
3

Disclaimer: This is just extending the above comment by Mike 'Pomax' Kamermans because to me it is actually a much simpler, applicable answer (for those of us who don't like reading through all the comments):

Array.from(str) splits your string into individual unicode characters without breaking them between bytes.

See Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters") for details.

Geithner answered 15/2, 2020 at 17:5 Comment(0)
T
2

This code has worked for me :

splitText = Array.from(usaText).slice(0, 5).join('');
Thaumatrope answered 24/4, 2020 at 14:39 Comment(3)
Welcome to stackoverflow. In addition to the answer you've provided, please consider providing a brief explanation of why and how this fixes the issue. โ€“ Kendrick
Hey, (0, 2) on your code results in A๐Ÿ‡บ. Usually one would either want the emoji included completely or not at all - instead of getting broken fractions โ€“ Albertoalberts
this is the correct answer. Not sure why it's not green โ€“ Palmirapalmistry

© 2022 - 2024 โ€” McMap. All rights reserved.