PHP - length of string containing emojis/special chars
Asked Answered
B

3

15

I'm building an API for a mobile application and I seem to have a problem with counting the length of a string containing emojis. My code:

$str = "πŸ‘πŸΏβœŒπŸΏοΈ @mention";

printf("strlen: %d" . PHP_EOL, strlen($str));
printf("mb_strlen UTF-8: %d" . PHP_EOL, mb_strlen($str, "UTF-8"));
printf("mb_strlen UTF-16: %d" . PHP_EOL, mb_strlen($str, "UTF-16"));
printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("UTF-8", "UTF-16", $str)));
printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("ISO-8859-1", "UTF-16", $str)));

the response of this is:

strlen: 27
mb_strlen UTF-8: 14
mb_strlen UTF-16: 13
iconv UTF-16: 14
iconv UTF-16: 27

however i should get 17 as the result. We tried to cound the string length on iOS, android and windows phone, it's 17 everywhere. iOS (swift) snippet:

var str = "πŸ‘πŸΏβœŒπŸΏοΈ @mention"
(str as NSString).length // 17
count(str) // 13
count(str.utf16) // 17
count(str.utf8) // 27

We need to use the NSString because of a library. I need this to get the starting and ending position of the "@mention". If the string contains only text or only emojis, it works fine so probably there is some issue with mixed content.

What am i doing wrong? What other info can I provide you guys to get me in the right direction?

Thanks!

Boutwell answered 2/6, 2015 at 19:0 Comment(1)
try using mb_substr, mb_str length can be an option – Moribund
R
24

Your functions are all counting different things.

Graphemes:                 πŸ‘            🏿          ✌                🏿️                  @      m      e      n      t      i      o      n     13
                      -----------  -----------  --------  ---------------------  ------ ------ ------ ------ ------ ------ ------ ------ ------
Code points:            U+1F44D      U+1F3FF     U+270C     U+1F3FF     U+FE0F   U+0020 U+0040 U+006D U+0065 U+006E U+0074 U+0069 U+006F U+006E  14
UTF-16 code units:     D83D DC4D    D83C DFFF     270C     D83C DFFF     FE0F     0020   0040   006D   0065   006E   0074   0069   006F   006E   17
UTF-16-encoded bytes: 3D D8 4D DC  3C D8 FF DF   0C 27    3C D8 FF DF   0F FE    20 00  40 00  6D 00  65 00  6E 00  74 00  69 00  6F 00  6E 00   34
UTF-8-encoded bytes:  F0 9F 91 8D  F0 9F 8F BF  E2 9C 8C  F0 9F 8F BF  EF B8 8F    20     40     6D     65     6E     74     69     6F     6E    27

PHP strings are natively bytes.

strlen() counts the number of bytes in a string: 27.

mb_strlen(..., 'utf-8') counts the number of code points (Unicode characters) in a string when its bytes are decoded to characters using the UTF-8 encoding: 14.

(The other example counts are largely meaningless as they're based on treating the input string as one encoding when actually it contains data in a different encoding.)

NSStrings are natively counted as UTF-16 code units. There are 17, not 14, because the above string contains characters like πŸ‘ that don't fit in a single UTF-16 code unit, so have to be encoded as a surrogate pair. There aren't any functions that will count strings in UTF-16 code units in PHP, but because each code unit is encoded to two bytes, you can work it out easily enough by encoding to UTF-16 and dividing the number of bytes by two:

strlen(iconv('utf-8', 'utf-16le', $str)) / 2

(Note: the le suffix is necessary to make iconv encode to a particular endianness of UTF-16, and not foul up the count by choosing one and adding a BOM to the start of the string to say which one it chose.)

Ryun answered 2/6, 2015 at 22:13 Comment(5)
πŸ˜›πŸ˜›πŸ˜›πŸ˜›πŸ˜›πŸ˜›πŸ˜› says 14, but only 7!! Your method doesn't seem to work. – Output
@Sibidharan: what do you mean, β€œmy method”? What form of counting did you use and what did you expect? As per the table above, πŸ˜›πŸ˜›πŸ˜›πŸ˜›πŸ˜›πŸ˜›πŸ˜› is 7 Unicode code points, 14 UTF-16 code units, or 29 UTF-8 bytes. – Ryun
@Ryun jsfiddle.net/sibidharan/9kk1h4zh/6 Check this! I need to count the number of unicode points. The real challenge appears for multi color emojis πŸ™πŸ½πŸ™πŸΏπŸ™πŸ» like this. Check that fiddle, see how it counts. I am just asking is that possible to count like the same!? – Output
@Output that JS is confusing, broken and incomplete but it essentially seems to be trying to group together combining characters. Are you trying to count grapheme clusters in PHP? If so, you are looking for php.net/manual/en/function.grapheme-extract.php – Ryun
@Ryun how would you deal with emoji with Zero Width Joiner ? Like this one πŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘© – Plascencia
L
5

I have included a picture to help illustrate the answer that @bobince gave.

Essentially, all non-surrogate-pair code points end up as two bytes in UTF-16 while all surrogate-pair code points end up as four bytes. If we divide this by two we get the equivalent expected length value.

P.S. Please forgive the error in the image where it says "code points" and should say "code units"

unicode breakdown

Lornalorne answered 21/1, 2016 at 8:23 Comment(0)
C
5

I know this is an older question, but I recently came across this problem and this might help someone else too.

The string in the question:

$str = "πŸ‘πŸΏβœŒπŸΏοΈ @mention";

I would expect it to count 11: πŸ‘πŸΏβœŒπŸΏοΈ{2}, space{1}, @{1}, mention{7} so 2 + 1 + 1 + 7 = 11

PHP's intl extension has a grapheme_strlen() function that gets string length in grapheme units instead of bytes or characters (documentation)

<?php
$str = "πŸ‘πŸΏβœŒπŸΏοΈ @mention";
printf("strlen: %d" . PHP_EOL, strlen($str));
printf("mb_strlen UTF-8: %d" . PHP_EOL, mb_strlen($str, "UTF-8"));
printf("mb_strlen UTF-16: %d" . PHP_EOL, mb_strlen($str, "UTF-16"));
printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("UTF-8", "UTF-16", $str)));
printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("ISO-8859-1", "UTF-16", $str)));
printf("grapheme_strlen: %d" . PHP_EOL, grapheme_strlen($str));

output on PHP 7.4:

strlen: 27
mb_strlen UTF-8: 14
mb_strlen UTF-16: 13
PHP Notice:  iconv_strlen(): Detected an illegal character in input string in php shell code on line 1
iconv UTF-16: 0
PHP Notice:  iconv_strlen(): Detected an illegal character in input string in php shell code on line 1
iconv UTF-16: 0
grapheme_strlen: 11

Or try it here: https://www.tehplayground.com/s8e40DAslYX4Mozo

Calore answered 19/1, 2021 at 21:49 Comment(0)

© 2022 - 2024 β€” McMap. All rights reserved.