Why does a string containing a single emoji, like "👍", have a length of 2?
Asked Answered
R

3

23

How does any textarea in my browser handle what seems to be 2 characters represented as one?

For example:

"πŸ‘".length
// -> 2

More examples here: https://jsbin.com/zazexenigi/edit?js,console

Robena answered 13/7, 2016 at 7:36 Comment(1)
See developer.teradata.com/blog/jasonstrimpel/2011/11/…. – Exclosure
N
21

Javascript uses UTF-16 (source) to manage strings.

In UTF-16 there are 1,112,064 possible characters. Now, each character uses code points to be represented(*). In UTF-16 one code-point use two bytes (16 bits) to be saved. This means that with one code point you can have only 65536 different characters.

This means some characters has to be represented with two code points.

String.length() returns the number of code units in the string, not the number of characters.

MDN explains quite well the thing on the page about String.length()

This property returns the number of code units in the string. UTF-16, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters, but needs to use two code units for less commonly-used characters, so it's possible for the value returned by length to not match the actual number of characters in the string.

(*): Actually some chars, in the range 010000 – 03FFFF and 040000 – 10FFFF can use up to 4 bytes (32 bits) per code point, but this doesn't change the answer: some chars requires more than 2 bytes to be represented, so they need more than 1 code point.

This means that some chars that need more than 16 bits have a length of 1 anyway. Like 0x03FFFF, it needs 21 bits, but it uses only one code unit in UTF-16, so its String.length is 1.

console.log(String.fromCharCode(0x03FFFF).length)
Neotropical answered 13/7, 2016 at 7:50 Comment(5)
I think only ES2015 uses UTF-16 both internally on the engine and on the language level. ES5 encodes with UCT-2 (at least on the language level). Besides there is only one code point per character (from 0x0 to 0x10FFFF)) which is represented by one to two code units. Because string.length interprets code units as single characters it computes wrong results for characters outside the Basic Multilingual Plane (BMP). – Synchromesh
@LUH3417 afaik ES5 uses UTF-16 as well: When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. es5.github.io – Neotropical
Oh, my mistake. It is called UCS-2 and ES5 engines can use either of them (UCS-2/UTF-16). – Synchromesh
@LUH3417 please feel free to improve my answer :) – Neotropical
No need. The comments do the job. More about Unicode in ES2015. – Synchromesh
I
14

I believe rpadovani answered your "why" question best, but for an implementation that will get you a proper glyph count in this situation, Lodash has tacked this problem in their toArray module.

For example,

_.toArray('12πŸ‘ͺ').length; // --> 3

Or, if you want to knock a few arbitrary characters off a string, you manipulate and rejoin the array, like:

_.toArray("πŸ‘ͺtrimToEightGlyphs").splice(0,8).join(''); // --> 'πŸ‘ͺtrimToE'
Implausibility answered 6/9, 2017 at 22:15 Comment(2)
This can be done with native JS, there's no need for the extra lodash dependency: Array.from('12πŸ‘ͺ').length // --> 3. – Cretinism
Array.from() not working for all emojis. Try this console.log(Array.from("12πŸ‘¨β€πŸ‘©β€πŸ‘§").length); – Weaponless
P
9

I found a simple way to get the right result.
Here it is :

'πŸ‘Some text with emojisπŸ‘'.match(/./gu)

It should return:

[ "πŸ‘","S", "o", "m", "e", " ", "t", "e", "x", "t", " ", "w", "i", "t", "h", " ", "e", "m", "o", "j", "i", "s", "πŸ‘"]

You can then apply .length on it :

'πŸ‘'.match(/./gu).length == 1

It uses a regex match : /./gu

. matches any single character.
g mean 'global' : it basicly allow to not stop after the first match.
u mean 'unicode' : it allows to show characters the right way (without it πŸ‘ would show up as οΏ½οΏ½ (so 2 characters))

Btw you can add m to support multi line (/./gum)

Politicize answered 8/10, 2020 at 11:47 Comment(3)
Does not work for all cases. For example, "πŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘©".match(/./gu).length outputs 8 – Outhouse
To @BrianK.'s point, I am sure some people will look for an answer how to count it as "one" but this is actually exactly what I need since these ARE actually 8 characters (paste it in a test editor that supports Unicode and start pressing Backspace", you'll see what is there!) and this is exactly how much "chars" it will take in MySQL's varchar field. So if you want to check to see if it is going to fit in your database before inserting it - this is probably what you want. – Olathe
It doesn't count all emojis as one char. For instance πŸ•―οΈis 2 chars and πŸ»β€β„οΈ is 4 chars. – Disport

© 2022 - 2024 β€” McMap. All rights reserved.