Difference between codePointAt and charCodeAt
Asked Answered
I

3

55

What is the difference between String.prototype.codePointAt() and String.prototype.charCodeAt() in JavaScript?

'A'.codePointAt(); // 65
'A'.charCodeAt();  // 65
Interlanguage answered 10/4, 2016 at 8:40 Comment(1)
MDN: charCodeAt - first paragraph.Tsar
S
57

From the MDN page on charCodeAt:

The charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.

The UTF-16 code unit matches the Unicode code point for code points which can be represented in a single UTF-16 code unit. If the Unicode code point cannot be represented in a single UTF-16 code unit (because its value is greater than 0xFFFF) then the code unit returned will be the first part of a surrogate pair for the code point. If you want the entire code point value, use codePointAt().

TLDR;

  • charCodeAt() is UTF-16
  • codePointAt() is Unicode.
Synthetic answered 10/4, 2016 at 8:48 Comment(1)
Just to add on, many unicode tables show hex values, e.g. U+00A0. Since codePointAt returns an int, you may want to convert it to hex by calling .toString(16) if you are looking up the unicode table manually.Mcclain
E
45

To add a few for the ToxicTeacakes's answer, here is another example to help you know the difference:

"𠮷".charCodeAt(0).toString(16);//d842
"𠮷".charCodeAt(1).toString(16);//dfb7

"𠮷".codePointAt(0);//20bb7
"𠮷".codePointAt(1);//dfb7

console.log("\ud842\udfb7");//𠮷, an example of hexadecimal digits
console.log("\u20bb7\udfb7");//₻7�
console.log("\u{20bb7}");//𠮷 an unicode code point escapes the "\ud842\udfb7"

The following is the info about javascript string literals:

"\uXXXX"
The Unicode character specified by the four hexadecimal digits XXXX. For example, \u00A9 is the Unicode sequence for the copyright symbol.

"\u{XXXXX}"
Unicode code point
escapes. For example, \u{2F804} is the same as the simple Unicode escapes \uD87E\uDC04.

see also msdn

Extraversion answered 10/4, 2017 at 16:16 Comment(2)
Great example, just what I was looking for! A couple questions: (i) Do you know why codePointAt(1) returns a value if the codePointAt(0) captures the "whole" code point? (ii) Do you know how toLowerCase() vs toLocaleLowercase() fits here? Do character sets like the one in your answer even have lower/upper case?Meurer
I think it's because it's still indexed by 16 bits. So the index is the same index interpreted by String.prototype.charAt and String.prototype.charCodeAt. If there is no element at the specified position, undefined is returned. If no UTF-16 surrogate pair begins at pos, the code unit at pos is returned.. So at index 1, the 16th bit, there's no surrogate pair at index range [1,2] so it becomes the broken character "\udfb7" ("�")Lowbred
S
5

Example in JS

On The example with strings and emojis, I am going to illustrate how things could go wrong when you do not know that some of the characters could consist of 2 code units. Some of the characters take up more than one code unit. Consider using codePointAt() over charCodeAt() or use the first one if you are sure that your characters lie in of between 0 and 65535 (216)

more about code units here

// charCodeAt() is UTF-16
// codePointAt() is Unicode

/* UTF-16 is generally considered a bad idea today */

const strings = ["o", "four", "to"];
const emojis = ["🐎", "👟"];

function printItemsLength(arr) {
    for (const item of arr) {
    console.log(item, item.length);
  }
}

printItemsLength(strings);
console.log('================================');
printItemsLength(emojis);
console.log('================================');
console.log("i.charCodeAt(0)", "i".charCodeAt(0)); // 105
console.log("i.charCodeAt(1)", "i".charCodeAt(1)); // 105
console.log("i.codePointAt(0)", "i".codePointAt(0)); // 105
console.log('=============EMOJIS=============');
// getting the decimal (dec) by which you can find them

console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[0] + '.charCodeAt(0)', emojis[0].charCodeAt(0)); // only half-character - 55357
console.log(emojis[0] + '.charCodeAt(1)', emojis[0].charCodeAt(1)); // only half-character - 55357

console.log('===========codePointAt===========');
console.log(emojis[0] + '.codePointAt(0)', emojis[0].codePointAt(0)); // 128014

console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[1] + '.charCodeAt(0)', emojis[1].charCodeAt(0)); // only half-character - 55357
console.log(emojis[1] + '.charCodeAt(1)', emojis[1].charCodeAt(1)); // only half-character - 55357

console.log('===========codePointAt===========');
// full-character
console.log(emojis[1] + '.codePointAt(0)', emojis[1].codePointAt(0)); // 128095
console.log(emojis[1] + '.codePointAt(1)', emojis[1].codePointAt(1)); // will return lower surragate (non-displayable character)
// to find this emojis have a look here: https://www.w3schools.com/charsets/ref_emoji.asp

as someone may have noticed I have tried to convert back from charcode to the emoji, and it did not work on one of the symbols (that is because it is not in range of UTF-16

Introduction to Unicode and UTF-16

please skip this section if you already familiar with it

Unicode – is a set of characters used around the world; UTF-16 - 00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "𤭢" (two 16-bits) read more

"surrogate pair" characters are emoji and some letters that consist of more than 1 character as it is explained here

The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF. read more

Unicode - It assigns every character a unique number called a code point.

Differentiating charCodeAt() from codePointAt()

charCodeAt(pos) returns code a code unit (not a full character).

If you need a character (that could be either one or two code units), you can use codePointAt(pos) to get its code.

charCodeAt() - returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index link codePointAt() - returns a non-negative integer that is the Unicode code point value at the given position link

where pos is the index of the character you want to check. Quote from the book:

UTF-16 is generally considered a bad idea today. It seems almost intentionally designed to invite mistakes. It’s easy to write programs that pretend code units and characters are the same things.

read more

jsfiddle sandbox Sources:

  1. What is Unicode, UTF-8, UTF-16?
  2. Marijn Haverbeke Eloquent JavaScript, 3rd Edition: A Modern Introduction to Programming [Text] – City(not-specified) : No Starch Press, 2018 – 447 p. can be found here
  3. What is "surrogate pair"
  4. to find this emojis have a look w3schools.com/charsets/ref_emoji

Chapter 5, p. 91 => Strings and character codes

Seacoast answered 9/2, 2022 at 0:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.