Unicode characters from charcode in javascript for charcodes > 0xFFFF
Asked Answered
C

4

14

I need to get a string / char from a unicode charcode and finally put it into a DOM TextNode to add into an HTML page using client side JavaScript.

Currently, I am doing:

String.fromCharCode(parseInt(charcode, 16));

where charcode is a hex string containing the charcode, e.g. "1D400". The unicode character which should be returned is 𝐀, but a 퐀 is returned! Characters in the 16 bit range (0000 ... FFFF) are returned as expected.

Any explanation and / or proposals for correction?

Thanks in advance!

Cassirer answered 27/3, 2011 at 1:3 Comment(1)
Here’s a detailed explanation: mathiasbynens.be/notes/javascript-encoding – Taranto
H
16

The problem is that characters in JavaScript are (mostly) UCS-2 encoded but can represent a character outside the Basic Multilingual Plane in JavaScript as a UTF-16 surrogate pair.

The following function is adapted from Converting punycode with dash character to Unicode:

function utf16Encode(input) {
    var output = [], i = 0, len = input.length, value;
    while (i < len) {
        value = input[i++];
        if ( (value & 0xF800) === 0xD800 ) {
            throw new RangeError("UTF-16(encode): Illegal UTF-16 value");
        }
        if (value > 0xFFFF) {
            value -= 0x10000;
            output.push(String.fromCharCode(((value >>>10) & 0x3FF) | 0xD800));
            value = 0xDC00 | (value & 0x3FF);
        }
        output.push(String.fromCharCode(value));
    }
    return output.join("");
}

alert( utf16Encode([0x1D400]) );
Hyperpyrexia answered 27/3, 2011 at 1:33 Comment(3)
Although I used the (shorter) code of Anomie, I accepted your solution since your code does a nice error checking (but I don't need it) – Cassirer
Note that the correct terminology is just UTF-16 encoding. This maps one to one to UCS-2 for the first 65536 characters, except for the surrogates. But from what we can see in your code, it's just "plain" UTF-16. – Acquisitive
@AlexisWilke: Not quite. JavaScript characters aren't exposed as either UCS-2 or UTF-16 really: it's identical to UCS-2, except that surrogates are allowed. It isn't UTF-16 because unmatched surrogates and surrogates in the wrong order are allowed. It's only when rendering the character in the browser that the UTF-16-style surrogates are combined into a single Unicode character. Here's a good article for background: mathiasbynens.be/notes/javascript-encoding – Hyperpyrexia
S
20

String.fromCharCode can only handle code points in the BMP (i.e. up to U+FFFF). To handle higher code points, this function from Mozilla Developer Network may be used to return the surrogate pair representation:

function fixedFromCharCode (codePt) {
    if (codePt > 0xFFFF) {
        codePt -= 0x10000;
        return String.fromCharCode(0xD800 + (codePt >> 10), 0xDC00 + (codePt & 0x3FF));
    } else {
        return String.fromCharCode(codePt);
    }
}
Squeegee answered 27/3, 2011 at 1:31 Comment(2)
So JScript strings are UTF-16 encoded, and this piece of code is a charcode => UTF-16 conversion as I understand... I expected that the problem (and solution) is something like this. It worked! Tanks! – Cassirer
I tried this and get a "character conversion error" - but I realized the script file was encoded in utf-8; when I changed the encoding to ucs2 (notepad++) it worked. – Selfappointed
H
16

The problem is that characters in JavaScript are (mostly) UCS-2 encoded but can represent a character outside the Basic Multilingual Plane in JavaScript as a UTF-16 surrogate pair.

The following function is adapted from Converting punycode with dash character to Unicode:

function utf16Encode(input) {
    var output = [], i = 0, len = input.length, value;
    while (i < len) {
        value = input[i++];
        if ( (value & 0xF800) === 0xD800 ) {
            throw new RangeError("UTF-16(encode): Illegal UTF-16 value");
        }
        if (value > 0xFFFF) {
            value -= 0x10000;
            output.push(String.fromCharCode(((value >>>10) & 0x3FF) | 0xD800));
            value = 0xDC00 | (value & 0x3FF);
        }
        output.push(String.fromCharCode(value));
    }
    return output.join("");
}

alert( utf16Encode([0x1D400]) );
Hyperpyrexia answered 27/3, 2011 at 1:33 Comment(3)
Although I used the (shorter) code of Anomie, I accepted your solution since your code does a nice error checking (but I don't need it) – Cassirer
Note that the correct terminology is just UTF-16 encoding. This maps one to one to UCS-2 for the first 65536 characters, except for the surrogates. But from what we can see in your code, it's just "plain" UTF-16. – Acquisitive
@AlexisWilke: Not quite. JavaScript characters aren't exposed as either UCS-2 or UTF-16 really: it's identical to UCS-2, except that surrogates are allowed. It isn't UTF-16 because unmatched surrogates and surrogates in the wrong order are allowed. It's only when rendering the character in the browser that the UTF-16-style surrogates are combined into a single Unicode character. Here's a good article for background: mathiasbynens.be/notes/javascript-encoding – Hyperpyrexia
B
8

Section 8.4 of the EcmaScript language spec says

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

So you need to encode supplemental code-points as pairs of UTF-16 code units.

The article "Supplementary Characters in the Java Platform" gives a good description of how to do this.

UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.

The following table shows the different representations of a few characters in comparison:

code points / UTF-16 code units

U+0041 / 0041

U+00DF / 00DF

U+6771 / 6771

U+10400 / D801 DC00

Once you know the UTF-16 code units, you can create a string using the javascript function String.fromCharCode:

String.fromCharCode(0xd801, 0xdc00) === '𐐀'
Bakst answered 27/3, 2011 at 1:34 Comment(4)
Thanks for this detailed explanation! It made me understanding the behavior of JScript strings deeper. It seems that the description of fromCharCode is wrong in the following doc of w3schools, since it just says "Unicode value", but 0x1A000 is a "Unicode value", too: W3Schools: fromCharCode() – Cassirer
@leemes, since I'm quoting the spec: "15.5.3.2 String.fromCharCode ( [ char0 [ , char1 [ , ... ] ] ] ) Returns a String value containing as many characters as the number of arguments. Each argument specifies one character of the resulting String, with the first argument specifying the first character, and so on, from left to right. An argument is converted to a character by applying the operation ToUint16 (9.7) and regarding the resulting 16-bit integer as the code unit value of a character. If no arguments are supplied, the result is the empty String." – Bakst
@leemes, Since chars are UTF-16 code units, and ToUint16(0x10000) === 0, trying to pass a supplemental code unit to String.fromCharCode will not work as intended. Unfortunately, String.fromCharCode(0x10000) === '\u0000'. Nebosja Ciric and others are trying to make the next version better i18n-wise : mail.mozilla.org/pipermail/es-discuss/2010-June/011380.html – Bakst
With "wrong" I meant the description of w3schools, not your quote... Since knowing that String.fromCharCode does NOT accept any Unicode charcode ("code point") but a 16 bit code representing a UTF-16 encoding ("UTF-16 code unit"), which is of course something different, it's all clear know. Thanks. – Cassirer
C
2

String.fromCodePoint() seems to do the trick as well. See here.

console.log(String.fromCodePoint(0x1D622, 0x1D623, 0x1D624, 0x1D400));

Output:

π˜’π˜£π˜€π€
Claypan answered 17/3, 2019 at 12:42 Comment(0)

© 2022 - 2024 β€” McMap. All rights reserved.