Convert a String into an Array of Characters - multi-byte

Asked 21/4, 2019 at 10:53 Answered 21/4, 2019 at 14:9

Solved php regex unicode split multibyte-characters

Assuming that in 2019 every solution which is not UNICODE-safe is wrong. What is the best way to convert a string to array of UNICODE characters in PHP?

Obviously this means that accessing the bytes with the brace syntax is wrong, as well as using str_split:

$arr = str_split($text);

From sample input like:

$string = '先éé€𐍈💩👩‍ 👩‍❤️‍👩';

I expect:

array(16) {


[0]=>
  string(3) "先"
  [1]=>
  string(2) "é"
  [2]=>
  string(1) "e"
  [3]=>
  string(2) "́"
  [4]=>
  string(3) "€"
  [5]=>
  string(4) "𐍈"
  [6]=>
  string(4) "💩"
  [7]=>
  string(4) "👩"
  [8]=>
  string(3) "‍"
  [9]=>
  string(1) " "
  [10]=>
  string(4) "👩"
  [11]=>
  string(3) "‍"
  [12]=>
  string(3) "❤"
  [13]=>
  string(3) "️"
  [14]=>
  string(3) "‍"
  [15]=>
  string(4) "👩"
}

Torritorricelli answered 21/4, 2019 at 10:53 Comment(2)

what do you want to encode in? utf8? – Tidbit 21/4, 2019 at 12:40

@DanyalSandeelo It doesn't matter which encoding. I only want to get the array of code-points instead of code-units, grapheme clusters or bytes. – Torritorricelli 21/4, 2019 at 18:51

Just pass an empty pattern with the PREG_SPLIT_NO_EMPTY flag. Otherwise, you can write a pattern with \X (unicode dot) and \K (restart fullstring match). I'll include a mb_split() call and a preg_match_all() call for completeness.

Code: (Demo)

$string='先秦兩漢';
var_export(preg_split('~~u', $string, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_split('~\X\K~u', $string, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_split('~\X\K(?!$)~u', $string));
echo "\n---\n";
var_export(mb_split('\X\K(?!$)', $string));
echo "\n---\n";
var_export(preg_match_all('~\X~u', $string, $out) ? $out[0] : []);

All produce::

array (
  0 => '先',
  1 => '秦',
  2 => '兩',
  3 => '漢',
)

From https://www.regular-expressions.info/unicode.html:

How to Match a Single Unicode Grapheme

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use \X.

You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

UPDATE, DHarman has brought to my attention that mb_str_split() is now available from PHP7.4.

The default length parameter of the new function is 1, so the length parameter can be omitted for this case.

https://wiki.php.net/rfc/mb_str_split

Dharman's demo: https://3v4l.org/M85Fi/rfc#output

UPDATE (2024-04-10):

The RFC has unanimously passed for grapheme_str_split() and is proposed for inclusion into PHP8.4. This provides a clean, native solution which will preserve bound multi-byte "clusters" (such emojis and variation selectors).

$string = '🙇‍♂️'
var_export(grapheme_str_split($string)); // ['🙇‍♂️']

Here is what the result would be if the cluster was not held together: (split on individual multibyte characters)

[
    '🙇'
    '',   // U+200D Zero Width Joiner
    '♂',
    '',   // U+FE0F Variation Selector
]

^{_{I'll add a 3v4l.org demo when possible.}}

Stuppy answered 21/4, 2019 at 14:9 Comment(7)

Does that mean that there is no multi-byte version of str_split available natively in PHP other than the regular expression matching? I would imagine that for large applications, RegEx would be pretty slow. – Torritorricelli 21/4, 2019 at 18:50

You seem to have forgotten u modifier in preg_match_all. – Torritorricelli 21/4, 2019 at 19:11

Only your first solution seems to be passing my unit tests: 3v4l.org/OIZcM – Torritorricelli 21/4, 2019 at 19:13

Why do you want to split the grave accent from the second e, but not the first? It looks inconsistent to me. If you had test strings and an expected result before receiving answers, why didn't you post them in your question to begin with? The trouble with all of the new emojis/emoticon/etc that using combining marks is presented here. It seems the regex engine is inconsistent about how to split them. – Stuppy 21/4, 2019 at 21:56

It seems reasonable to want to preserve the accent with its letter. Maybe: 3v4l.org/Bda4o – Stuppy 21/4, 2019 at 22:8

I prepared the test data after I asked the question. The second e with an accent is made up of 2 Unicode characters, and should be treated as such. There is many cases where few Unicode code-points make up one grapheme, but at the end of the day, it is not a single character. Unicode contains combining characters or even some languages (e.g Tamil) use letters made up of another letters e.g. சி <-- 2 unicode characters – Torritorricelli 21/4, 2019 at 22:55

mb_str_split added in PHP 7.4 and it passed my unit test: 3v4l.org/M85Fi/rfc#output – Torritorricelli 27/4, 2019 at 23:32

This works for me, it explodes a unicode string into an array of characters:

//
// split at all position not after the start: ^
// and not before the end: $, with unicode modifier
// u (PCRE_UTF8).
//
$arr = preg_split("/(?<!^)(?!$)/u", $text);

For example:

<?php
//
$text = "堆栈溢出";

$arr = preg_split("/(?<!^)(?!$)/u", $text);

echo '<html lang="fr">
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
</head>
<body>
';

print_r($arr);

echo '</body>
</html>
';
?>

In a browser, it produces this:

Array ( [0] => 堆 [1] => 栈 [2] => 溢 [3] => 出 )

Methodical answered 21/4, 2019 at 13:15 Comment(2)

I would love to get a better explanation of your solution. And please do not assume that the output will be for HTML. – Torritorricelli 21/4, 2019 at 19:15

This HTML browser output is only to show utf8 chars, as some console does not show them, but browser does. – Methodical 22/4, 2019 at 8:33

Recommended topics

Hot tags