Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters")
Asked Answered
A

4

32

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).

JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane).

To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively.

I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).

Depending on your needs, splitting by codepoint might not be enough, and you might want to split by "grapheme cluster", where a cluster is a base codepoint followed by all its non-spacing modifier codepoints, such as combining accents and diacritics.

For the purposes of this question I do not require splitting by grapheme cluster.

Alleged answered 28/1, 2014 at 5:9 Comment(0)
D
44

@bobince's answer has (luckily) become a bit dated; you can now simply use

var chars = Array.from( text )

to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.

Drops answered 4/3, 2017 at 14:3 Comment(1)
Or const chars = [...text];. Both use iteration under the covers. – Shugart
E
14

Along the lines of @John Frazer's answer, one can use this even succincter form of string iteration:

const chars = [...text]

e.g., with:

const text = 'A\uD835\uDC68B\uD835\uDC69C\uD835\uDC6A'
const chars = [...text] // ["A", "𝑨", "B", "𝑩", "C", "π‘ͺ"]
Endres answered 26/9, 2018 at 22:51 Comment(2)
best answer, if you like succinctness. – Correlate
this also works where the actual graphic symbols are pasted into the string (if your IDE supports that) – Correlate
H
5

In ECMAScript 6 you'll be able to use a string as an iterator to get code points, or you could search a string for /./ug, or you could call getCodePointAt(i) repeatedly.

Unfortunately for..of syntax and regexp flags can't be polyfilled and calling a polyfilled getCodePoint() would be super slow (O(nΒ²)), so we can't realistically use this approach for a while yet.

So doing it the manual way:

String.prototype.toCodePoints= function() {
    chars = [];
    for (var i= 0; i<this.length; i++) {
        var c1= this.charCodeAt(i);
        if (c1>=0xD800 && c1<0xDC00 && i+1<this.length) {
            var c2= this.charCodeAt(i+1);
            if (c2>=0xDC00 && c2<0xE000) {
                chars.push(0x10000 + ((c1-0xD800)<<10) + (c2-0xDC00));
                i++;
                continue;
            }
        }
        chars.push(c1);
    }
    return chars;
}

For the inverse to this see https://mcmap.net/q/48607/-javascript-strings-outside-of-the-bmp

Howarth answered 28/1, 2014 at 15:3 Comment(3)
getCodePointAt is O(n). The argument it accepts is not the codepoint index but the code unit index (the regular String index). – Oralee
@Oralee did you mean that getCodePointAt is O(1)? – Brittne
Yes, O(1), I can't edit the comment anymore – Oralee
V
1

Another method using codePointAt:

String.prototype.toCodePoints = function () {
  var arCP = [];
  for (var i = 0; i < this.length; i += 1) {
    var cP = this.codePointAt(i);
    arCP.push(cP);
    if (cP >= 0x10000) {
      i += 1;
    }
  }
  return arCP;
}
Valentijn answered 11/4, 2020 at 14:39 Comment(0)

© 2022 - 2024 β€” McMap. All rights reserved.