JavaScript strings outside of the BMP
Asked Answered
L

5

42

BMP being Basic Multilingual Plane

According to JavaScript: the Good Parts:

JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide.

This leads me to believe that JavaScript uses UCS-2 (not UTF-16!) and can only handle characters up to U+FFFF.

Further investigation confirms this:

> String.fromCharCode(0x20001);

The fromCharCode method seems to only use the lowest 16 bits when returning the Unicode character. Trying to get U+20001 (CJK unified ideograph 20001) instead returns U+0001.

Question: is it at all possible to handle post-BMP characters in JavaScript?


2011-07-31: slide twelve from Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly covers issues related to this quite well:

Landmeier answered 19/9, 2010 at 6:17 Comment(5)
If it were using UTF-16, then you would expect characters outside the basic multilingual plane to be supported using surrogate pairs. Why would you expect it to accept a 32-bit character?Brunell
Thanks a lot for that, I never thought of it that way.Landmeier
@MichaelAaronSafyan: Because JavaScript doesn't have anything resembling a "char" type and String.fromCharCode() returns a string it seems fair to expect it to return a string containing both code units that make up the character. I believe there will be a String.fromCodePoint() added to a future JavaScript standard to do exactly that.Tennietenniel
Your question explained why I would keep getting length === 1 after using String.fromCharCodeIllona
You can now do "\u{23222}" in ES6 :DDogmatism
S
36

Depends what you mean by ‘support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can.

But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, slice etc) all deal with code units not characters, so will quite happily split surrogate pairs or hold invalid surrogate sequences.

If you want surrogate-aware methods, I'm afraid you're going to have to start writing them yourself! For example:

String.prototype.getCodePointLength= function() {
    return this.length-this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length+1;
};

String.fromCodePoint= function() {
    var chars= Array.prototype.slice.call(arguments);
    for (var i= chars.length; i-->0;) {
        var n = chars[i]-0x10000;
        if (n>=0)
            chars.splice(i, 1, 0xD800+(n>>10), 0xDC00+(n&0x3FF));
    }
    return String.fromCharCode.apply(null, chars);
};
Splendor answered 21/9, 2010 at 10:16 Comment(7)
Thank you very much. That's a great, detailed answer.Landmeier
@Splendor So, technically, does JS use UCS-2 or UTF-16? UCS-2 doesn’t support characters outside the BMP, but JavaScript does if the individual surrogate halves are entered individually (e.g. '\uD834\uDD1E' for U+1D11E). But does that make it UTF-16?Barter
@Mathias: JavaScript is UTF-16-ignorant. It gives you a sequence of 16-bit code units and lets you put what you like in it. You can store surrogates in it if you want, but you won't get any special features to handle them as characters. Whether you want to describe that as ‘using’ UCS-2 or UTF-16 is a semantic argument to which there is not one definitive answer. However regardless of language-level support in JS, other parts of the browser do support surrogates for rendering/interation in the UI, so it makes some sense to include them in JS strings.Splendor
@Splendor Thanks! I looked into it a bit further and have written up my findings here: mathiasbynens.be/notes/javascript-encoding Feedback welcome.Barter
(Updated fromCodePoint to match the name proposed for ECMAScript 6's support for proper Unicode. This is now effectively a polyfill.)Splendor
Changing my vote since this answer is now out of date in saying "There is no language-level support for handling full characters". There is now some language level full character support, such as codePointAt, fromCodePoint, Array.from(), /u, for ... of, the ... operator. Perhaps others?Tennietenniel
Yes, @Tennietenniel is right. This used to be the best answer, but times have changed. For recent browsers use the built-in browser support as explained by Michael Allen instead.Murrhine
R
4

Yes — JavaScript nowadays ships with reliable tools to handle characters outside the BMP. We should make use of them (as Stijn de Witt points out) and not roll our own solutions.

fromCodePoint

See function String.fromCodePoint.

const ideograph = String.fromCodePoint( 0x2A6DF/*outside the BMP*/ );
console.log( ideograph );
  → 𪛟

codePointAt

See method String.prototype.codePointAt.

const codePoint = "𪛟".codePointAt( 0 );
console.log( codePoint.toString( 16 ));
  → 2A6DF

iterator

See method String.prototype[@@iterator].

This method returns an iterator over the code points of the string. You can use it, for instance, to calculate a string’s length in code-point units:

function countCodePoints( str ) {
    let count = 0;
    const i = str[Symbol.iterator](); /* ‘The value of the "name" property
      of this method is "[Symbol.iterator]".’ — ECMA spec */
    while( !i.next().done ) ++count;
    return count; }
console.log( "𪛟".length ); // Length in 16-bit characters.
  → 2
console.log( countCodePoints( "𪛟" )); // Length in Unicode code points.
  → 1
Riana answered 24/12, 2017 at 19:46 Comment(2)
This answer is underrated. It should become the accepted one I think. When native implementations are available you dont want to roll your own. Besides, the answers above modify the prototype of native String object. Which is frowned upon in recent years because that way of writing code makes that code influence the runtime that all other code also lives in. In other words, causes side effects. With possibly unexpected behavioural changes in otherwise completely unrelated code. Don't modify native objects. So use this code and not that from the other answers.Murrhine
@Stijn de Witt: My fault if it’s underrated, it was hardly readable. Latest edit might help.Riana
P
3

I came to the same conclusion as bobince. If you want to work with strings containing unicode characters outside of the BMP, you have to reimplement javascript's String methods. This is because javascript counts characters as each 16-bit code value. Symbols outside of the BMP need two code values to be represented. You therefore run into a case where some symbols count as two characters and some count only as one.

I've reimplemented the following methods to treat each unicode code point as a single character: .length, .charCodeAt, .fromCharCode, .charAt, .indexOf, .lastIndexOf, .splice, and .split.

You can check it out on jsfiddle: http://jsfiddle.net/Y89Du/

Here's the code without comments. I tested it, but it may still have errors. Comments are welcome.

if (!String.prototype.ucLength) {
    String.prototype.ucLength = function() {
        // this solution was taken from 
        // https://mcmap.net/q/48607/-javascript-strings-outside-of-the-bmp
        return this.length - this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length + 1;
    };
}

if (!String.prototype.codePointAt) {
    String.prototype.codePointAt = function (ucPos) {
        if (isNaN(ucPos)){
            ucPos = 0;
        }
        var str = String(this);
        var codePoint = null;
        var pairFound = false;
        var ucIndex = -1;
        var i = 0;  
        while (i < str.length){
            ucIndex += 1;
            var code = str.charCodeAt(i);
            var next = str.charCodeAt(i + 1);
            pairFound = (0xD800 <= code && code <= 0xDBFF && 0xDC00 <= next && next <= 0xDFFF);
            if (ucIndex == ucPos){
                codePoint = pairFound ? ((code - 0xD800) * 0x400) + (next - 0xDC00) + 0x10000 : code;
                break;
            } else{
                i += pairFound ? 2 : 1;
            }
        }
        return codePoint;
    };
}

if (!String.fromCodePoint) {
    String.fromCodePoint = function () {
        var strChars = [], codePoint, offset, codeValues, i;
        for (i = 0; i < arguments.length; ++i) {
            codePoint = arguments[i];
            offset = codePoint - 0x10000;
            if (codePoint > 0xFFFF){
                codeValues = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)];
            } else{
                codeValues = [codePoint];
            }
            strChars.push(String.fromCharCode.apply(null, codeValues));
        }
        return strChars.join("");
    };
}

if (!String.prototype.ucCharAt) {
    String.prototype.ucCharAt = function (ucIndex) {
        var str = String(this);
        var codePoint = str.codePointAt(ucIndex);
        var ucChar = String.fromCodePoint(codePoint);
        return ucChar;
    };
}

if (!String.prototype.ucIndexOf) {
    String.prototype.ucIndexOf = function (searchStr, ucStart) {
        if (isNaN(ucStart)){
            ucStart = 0;
        }
        if (ucStart < 0){
            ucStart = 0;
        }
        var str = String(this);
        var strUCLength = str.ucLength();
        searchStr = String(searchStr);
        var ucSearchLength = searchStr.ucLength();
        var i = ucStart;
        while (i < strUCLength){
            var ucSlice = str.ucSlice(i,i+ucSearchLength);
            if (ucSlice == searchStr){
                return i;
            }
            i++;
        }
        return -1;
    };
}

if (!String.prototype.ucLastIndexOf) {
    String.prototype.ucLastIndexOf = function (searchStr, ucStart) {
        var str = String(this);
        var strUCLength = str.ucLength();
        if (isNaN(ucStart)){
            ucStart = strUCLength - 1;
        }
        if (ucStart >= strUCLength){
            ucStart = strUCLength - 1;
        }
        searchStr = String(searchStr);
        var ucSearchLength = searchStr.ucLength();
        var i = ucStart;
        while (i >= 0){
            var ucSlice = str.ucSlice(i,i+ucSearchLength);
            if (ucSlice == searchStr){
                return i;
            }
            i--;
        }
        return -1;
    };
}

if (!String.prototype.ucSlice) {
    String.prototype.ucSlice = function (ucStart, ucStop) {
        var str = String(this);
        var strUCLength = str.ucLength();
        if (isNaN(ucStart)){
            ucStart = 0;
        }
        if (ucStart < 0){
            ucStart = strUCLength + ucStart;
            if (ucStart < 0){ ucStart = 0;}
        }
        if (typeof(ucStop) == 'undefined'){
            ucStop = strUCLength - 1;
        }
        if (ucStop < 0){
            ucStop = strUCLength + ucStop;
            if (ucStop < 0){ ucStop = 0;}
        }
        var ucChars = [];
        var i = ucStart;
        while (i < ucStop){
            ucChars.push(str.ucCharAt(i));
            i++;
        }
        return ucChars.join("");
    };
}

if (!String.prototype.ucSplit) {
    String.prototype.ucSplit = function (delimeter, limit) {
        var str = String(this);
        var strUCLength = str.ucLength();
        var ucChars = [];
        if (delimeter == ''){
            for (var i = 0; i < strUCLength; i++){
                ucChars.push(str.ucCharAt(i));
            }
            ucChars = ucChars.slice(0, 0 + limit);
        } else{
            ucChars = str.split(delimeter, limit);
        }
        return ucChars;
    };
}
Pulverulent answered 1/2, 2013 at 7:19 Comment(3)
Many thanks for releasing into public domain. You, sir/madam, are a gentleman/woman and a scholar.Directorial
ucCharAt seems to be broken. "🌔🌖🐺🐶🍄".ucCharAt(0) returns the correct value but change the 0 to a 1 and it returns gibberish. Change it to 2 and it returns the second (instead of the first) symbol. So to get to the last symbol, you have to call ucCharAt(8) which is larger than the string's ucLength.Directorial
Don't modify native objects. Use the built-in browser support as explained by Michael Allen instead.Murrhine
H
0

Yes, you can. Although support to non-BMP characters directly in source documents is optional according to the ECMAScript standard, modern browsers let you use them. Naturally, the document encoding must be properly declared, and for most practical purposes you would need to use the UTF-8 encoding. Moreover, you need an editor that can handle UTF-8, and you need some input method(s); see e.g. my Full Unicode Input utility.

Using suitable tools and settings, you can write var foo = '𠀁'.

The non-BMP characters will be internally represented as surrogate pairs, so each non-BMP character counts as 2 in the string length.

Hibiscus answered 10/12, 2012 at 7:26 Comment(0)
H
0

Using for (c of this) instruction, one can make various computations on a string that contains non-BMP characters. For instance, to compute the string length, and to get the nth character of the string:

String.prototype.magicLength = function()
{
    var c, k;
    k = 0;
    for (c of this) // iterate each char of this
    {
        k++;
    }
    return k;
}

String.prototype.magicCharAt = function(n)
{
    var c, k;
    k = 0;
    for (c of this) // iterate each char of this
    {
        if (k == n) return c + "";
        k++;
    }
    return "";
}
Hedges answered 25/11, 2018 at 20:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.