How to iterate over only the characters in a string I can actually see?

Asked 22/4, 2016 at 4:40 Answered 8/3 at 9:25

javascript unicode surrogate-pairs astral-plane

Normally I would just use something like str[i].

But what if str = "☀️🙌🏼"?

str[i] fails. for (x of str) console.log(x) also fails. It prints out a total of 4 characters, even though there are clearly only 2 emoji in the string.

What's the best way to iterate over every character I can see in a string (and newlines, I guess), and nothing else?

The ideal solution would return an array of 2 characters: the 2 emoji, and nothing else. The claimed duplicate, and a bunch of other solutions I've found, don't fit this criteria.

Schottische answered 22/4, 2016 at 4:40 Comment(10)

I think you should check this blog post : link – Dandridge 22/4, 2016 at 4:49

Possible duplicate of Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters") – Mythicize 22/4, 2016 at 4:59

Are you saying you want to capture the emoji, or skip over it and find the next "normal" character? – Messroom 22/4, 2016 at 5:0

@RaymondChen your suggested answer appears to be a polyfill for the for...of syntax which I pointed out does not work in this case. But please correct me if I'm wrong! – Schottische 22/4, 2016 at 9:32

@Messroom I would like to capture the emoji as a single character. Essentially if I can select it as a single character, I'd like to capture it as a single character. – Schottische 22/4, 2016 at 9:33

The suggested answer says "for..of cannot be polyfilled." The suggested answers shows how to split a string into code points. If you don't want to polyfill it, then just use it as a free function. – Mythicize 22/4, 2016 at 14:23

@RaymondChen My desired answer should only be 2 characters in length (both emojis and nothing else). The toCodePoints function returns an array of length 4. – Schottische 22/4, 2016 at 19:57

First of all, your original statement is incorrect. the for (x in str) console.log(x) prints six characters (plus additional junk not relevant to the discussion), not the four you originally claimed. That's because the string "☀️🙌🏼" is six code units long: "\u2600\ufe0f\ud83d\ude4c\ud83c\udffc". This breaks down into four code points: U+2600 (BLACK SUN WITH RAYS), U+FE0F (VARIANT SELECTOR 16), U+1F64C (PERSON RAISING BOTH HANDS IN CELEBRATION), and U+1F3FC (EMOJI MODIFIER FITZPATRICK TYPE 3). It sounds like you are looking to break into graphemes, which is a harder problem. – Mythicize 22/4, 2016 at 22:38

@RaymondChen I said for (x of str) not x in str specifically because of breaks on code points rather than characters. Graphemes turned out to be the magic word here though - once I googled for that I quickly found a decent library to get the job done. – Schottische 22/4, 2016 at 23:40

See my solution posted under a different question that doesn't take Astral characters/Surrogate pairs into account: #1966976 – Saskatoon 5/7, 2017 at 9:23

You need to make your own methods for astral characters.

"foo🙌bar".match(/[\uD800-\uDBFF][\uDC00-\uDFFF]|./g);
// => ["f", "o", "o", "🙌", "b", "a", "r"]

Sydney answered 22/4, 2016 at 5:7 Comment(2)

This does not work in all cases. Consider "foo🙌b☀️ar".match(/[\uD800-\uDBFF][\uDC00-\uDFFF]|./g);. – Schottische 22/4, 2016 at 23:31

@thedayturns: Yeah, I only covered astral characters, which is where JavaScript "mistakenly" splits a single Unicode character into two JS characters. The emptyish string there is a VARIATION SELECTOR 16 (U+FE0F), which is a separate Unicode character, but combines with the previous; a similar issue would be all the combining characters like COMBINING ACUTE ACCENT (U+0301). So to solve that problem, you would need a whole library, which is outside the scope of a StackOverflow answer. – Sydney 23/4, 2016 at 13:35

Segmenter will do what you need:

The Intl.Segmenter object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string.

In you case, the code would look like this:

const segmenterEmoji = new Intl.Segmenter('en', { granularity: 'word' });
const string2 = '☀️🙌🏼'

const iterator1 = segmenterEmoji.segment(string2)[Symbol.iterator]();

console.log(iterator1.next().value.segment);
// Expected output: '☀️'

console.log(iterator1.next().value.segment);
// Expected output: '🙌🏼'

Note: The language/locale doesn't really matter in your case because emojis are a little different to "normal text"

In the example from MDN:

const segmenterFr = new Intl.Segmenter('fr', { granularity: 'word' });
const string1 = 'Que ma joie demeure';

const iterator1 = segmenterFr.segment(string1)[Symbol.iterator]();

console.log(iterator1.next().value.segment);
// Expected output: 'Que'

console.log(iterator1.next().value.segment);
// Expected output: ' '

Forint answered 8/3 at 9:25 Comment(0)

Recommended topics

Hot tags