What is the difference between String.prototype.codePointAt()
and String.prototype.charCodeAt()
in JavaScript?
'A'.codePointAt(); // 65
'A'.charCodeAt(); // 65
What is the difference between String.prototype.codePointAt()
and String.prototype.charCodeAt()
in JavaScript?
'A'.codePointAt(); // 65
'A'.charCodeAt(); // 65
From the MDN page on charCodeAt
:
The
charCodeAt()
method returns an integer between0
and65535
representing the UTF-16 code unit at the given index.The UTF-16 code unit matches the Unicode code point for code points which can be represented in a single UTF-16 code unit. If the Unicode code point cannot be represented in a single UTF-16 code unit (because its value is greater than
0xFFFF
) then the code unit returned will be the first part of a surrogate pair for the code point. If you want the entire code point value, usecodePointAt()
.
charCodeAt()
is UTF-16codePointAt()
is Unicode.codePointAt
returns an int, you may want to convert it to hex by calling .toString(16)
if you are looking up the unicode table manually. –
Mcclain To add a few for the ToxicTeacakes's answer, here is another example to help you know the difference:
"𠮷".charCodeAt(0).toString(16);//d842
"𠮷".charCodeAt(1).toString(16);//dfb7
"𠮷".codePointAt(0);//20bb7
"𠮷".codePointAt(1);//dfb7
console.log("\ud842\udfb7");//𠮷, an example of hexadecimal digits
console.log("\u20bb7\udfb7");//₻7�
console.log("\u{20bb7}");//𠮷 an unicode code point escapes the "\ud842\udfb7"
The following is the info about javascript string literals:
"\uXXXX"
The Unicode character specified by the four hexadecimal digits XXXX. For example, \u00A9 is the Unicode sequence for the copyright symbol."\u{XXXXX}"
Unicode code point
escapes. For example, \u{2F804} is the same as the simple Unicode escapes \uD87E\uDC04.
see also msdn
codePointAt(1)
returns a value if the codePointAt(0)
captures the "whole" code point? (ii) Do you know how toLowerCase()
vs toLocaleLowercase()
fits here? Do character sets like the one in your answer even have lower/upper case? –
Meurer String.prototype.charAt
and String.prototype.charCodeAt
. If there is no element at the specified position, undefined is returned. If no UTF-16 surrogate pair begins at pos, the code unit at pos is returned.. So at index 1, the 16th bit, there's no surrogate pair at index range [1,2] so it becomes the broken character "\udfb7" ("�") –
Lowbred On The example with strings and emojis, I am going to illustrate how things could go wrong when you do not know that some of the characters could consist of 2 code units. Some of the characters take up more than one code unit. Consider using codePointAt()
over charCodeAt()
or use the first one if you are sure that your characters lie in of between 0
and 65535
(216)
// charCodeAt() is UTF-16
// codePointAt() is Unicode
/* UTF-16 is generally considered a bad idea today */
const strings = ["o", "four", "to"];
const emojis = ["🐎", "👟"];
function printItemsLength(arr) {
for (const item of arr) {
console.log(item, item.length);
}
}
printItemsLength(strings);
console.log('================================');
printItemsLength(emojis);
console.log('================================');
console.log("i.charCodeAt(0)", "i".charCodeAt(0)); // 105
console.log("i.charCodeAt(1)", "i".charCodeAt(1)); // 105
console.log("i.codePointAt(0)", "i".codePointAt(0)); // 105
console.log('=============EMOJIS=============');
// getting the decimal (dec) by which you can find them
console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[0] + '.charCodeAt(0)', emojis[0].charCodeAt(0)); // only half-character - 55357
console.log(emojis[0] + '.charCodeAt(1)', emojis[0].charCodeAt(1)); // only half-character - 55357
console.log('===========codePointAt===========');
console.log(emojis[0] + '.codePointAt(0)', emojis[0].codePointAt(0)); // 128014
console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[1] + '.charCodeAt(0)', emojis[1].charCodeAt(0)); // only half-character - 55357
console.log(emojis[1] + '.charCodeAt(1)', emojis[1].charCodeAt(1)); // only half-character - 55357
console.log('===========codePointAt===========');
// full-character
console.log(emojis[1] + '.codePointAt(0)', emojis[1].codePointAt(0)); // 128095
console.log(emojis[1] + '.codePointAt(1)', emojis[1].codePointAt(1)); // will return lower surragate (non-displayable character)
// to find this emojis have a look here: https://www.w3schools.com/charsets/ref_emoji.asp
as someone may have noticed I have tried to convert back from charcode to the emoji, and it did not work on one of the symbols (that is because it is not in range of UTF-16
please skip this section if you already familiar with it
Unicode
– is a set of characters used around the world;UTF-16
- 00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "𤭢" (two 16-bits) read more
"surrogate pair" characters are emoji and some letters that consist of more than 1 character as it is explained here
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF. read more
Unicode
- It assigns every character a unique number called a code point.
charCodeAt()
from codePointAt()
charCodeAt(pos)
returns code a code unit (not a full character).
If you need a character (that could be either one or two code units), you can use codePointAt(pos)
to get its code.
charCodeAt()
- returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index link
codePointAt()
- returns a non-negative integer that is the Unicode code point value at the given position link
where pos
is the index of the character you want to check.
Quote from the book:
UTF-16 is generally considered a bad idea today. It seems almost intentionally designed to invite mistakes. It’s easy to write programs that pretend code units and characters are the same things.
jsfiddle sandbox Sources:
Chapter 5, p. 91 => Strings and character codes
© 2022 - 2024 — McMap. All rights reserved.