How to iterate over only the characters in a string I can actually see?
Asked Answered
S

2

6

Normally I would just use something like str[i].

But what if str = "β˜€οΈπŸ™ŒπŸΌ"?

str[i] fails. for (x of str) console.log(x) also fails. It prints out a total of 4 characters, even though there are clearly only 2 emoji in the string.

What's the best way to iterate over every character I can see in a string (and newlines, I guess), and nothing else?

The ideal solution would return an array of 2 characters: the 2 emoji, and nothing else. The claimed duplicate, and a bunch of other solutions I've found, don't fit this criteria.

Schottische answered 22/4, 2016 at 4:40 Comment(10)
I think you should check this blog post : link – Dandridge
Possible duplicate of Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters") – Mythicize
Are you saying you want to capture the emoji, or skip over it and find the next "normal" character? – Messroom
@RaymondChen your suggested answer appears to be a polyfill for the for...of syntax which I pointed out does not work in this case. But please correct me if I'm wrong! – Schottische
@Messroom I would like to capture the emoji as a single character. Essentially if I can select it as a single character, I'd like to capture it as a single character. – Schottische
The suggested answer says "for..of cannot be polyfilled." The suggested answers shows how to split a string into code points. If you don't want to polyfill it, then just use it as a free function. – Mythicize
@RaymondChen My desired answer should only be 2 characters in length (both emojis and nothing else). The toCodePoints function returns an array of length 4. – Schottische
First of all, your original statement is incorrect. the for (x in str) console.log(x) prints six characters (plus additional junk not relevant to the discussion), not the four you originally claimed. That's because the string "β˜€οΈπŸ™ŒπŸΌ" is six code units long: "\u2600\ufe0f\ud83d\ude4c\ud83c\udffc". This breaks down into four code points: U+2600 (BLACK SUN WITH RAYS), U+FE0F (VARIANT SELECTOR 16), U+1F64C (PERSON RAISING BOTH HANDS IN CELEBRATION), and U+1F3FC (EMOJI MODIFIER FITZPATRICK TYPE 3). It sounds like you are looking to break into graphemes, which is a harder problem. – Mythicize
@RaymondChen I said for (x of str) not x in str specifically because of breaks on code points rather than characters. Graphemes turned out to be the magic word here though - once I googled for that I quickly found a decent library to get the job done. – Schottische
See my solution posted under a different question that doesn't take Astral characters/Surrogate pairs into account: #1966976 – Saskatoon
S
1

You need to make your own methods for astral characters.

"fooπŸ™Œbar".match(/[\uD800-\uDBFF][\uDC00-\uDFFF]|./g);
// => ["f", "o", "o", "πŸ™Œ", "b", "a", "r"]
Sydney answered 22/4, 2016 at 5:7 Comment(2)
This does not work in all cases. Consider "fooπŸ™Œbβ˜€οΈar".match(/[\uD800-\uDBFF][\uDC00-\uDFFF]|./g);. – Schottische
@thedayturns: Yeah, I only covered astral characters, which is where JavaScript "mistakenly" splits a single Unicode character into two JS characters. The emptyish string there is a VARIATION SELECTOR 16 (U+FE0F), which is a separate Unicode character, but combines with the previous; a similar issue would be all the combining characters like COMBINING ACUTE ACCENT (U+0301). So to solve that problem, you would need a whole library, which is outside the scope of a StackOverflow answer. – Sydney
F
0

Segmenter will do what you need:

The Intl.Segmenter object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string.

In you case, the code would look like this:

const segmenterEmoji = new Intl.Segmenter('en', { granularity: 'word' });
const string2 = 'β˜€οΈπŸ™ŒπŸΌ'

const iterator1 = segmenterEmoji.segment(string2)[Symbol.iterator]();

console.log(iterator1.next().value.segment);
// Expected output: 'β˜€οΈ'

console.log(iterator1.next().value.segment);
// Expected output: 'πŸ™ŒπŸΌ'

Note: The language/locale doesn't really matter in your case because emojis are a little different to "normal text"

In the example from MDN:

const segmenterFr = new Intl.Segmenter('fr', { granularity: 'word' });
const string1 = 'Que ma joie demeure';

const iterator1 = segmenterFr.segment(string1)[Symbol.iterator]();

console.log(iterator1.next().value.segment);
// Expected output: 'Que'

console.log(iterator1.next().value.segment);
// Expected output: ' '
Forint answered 8/3 at 9:25 Comment(0)

© 2022 - 2024 β€” McMap. All rights reserved.