check if is multibyte string in PHP

H

3

10

I want to check if is a string type multibyte on PHP. Have any idea how to accomplish this?

Example:

<?php!
$string = "I dont have idea that is what i am...";
if( is_multibyte( $string ) )
{
    echo 'yes!!';
}else{
    echo 'ups!';
}
?>

Maybe( rule 8 bytes ):

<?php
if( mb_strlen( $string ) > strlen() )
{
    return true;
}
else
{
    return false;
}
?>

I read: Variable width encoding - WIKI and UTF-8 - WIKI

Hosmer answered 29/5, 2013 at 18:42 Comment(0)

W

10

There are two interpretations. The first is that every character is multibyte. The second is that the string contains one multibyte character at least. If you have an interest for handling invalid byte sequence, see https://mcmap.net/q/145636/-replacing-invalid-utf-8-characters-by-question-marks-mbstring-substitute_character-seems-ignored for details.

function is_all_multibyte($string)
{
    // check if the string doesn't contain invalid byte sequence
    if (mb_check_encoding($string, 'UTF-8') === false) return false;

    $length = mb_strlen($string, 'UTF-8');

    for ($i = 0; $i < $length; $i += 1) {

        $char = mb_substr($string, $i, 1, 'UTF-8');

        // check if the string doesn't contain single character
        if (mb_check_encoding($char, 'ASCII')) {

            return false;

        }

    }

    return true;

}

function contains_any_multibyte($string)
{
    return !mb_check_encoding($string, 'ASCII') && mb_check_encoding($string, 'UTF-8');
}

$data = ['東京', 'Tokyo', '東京(Tokyo)'];

var_dump(
    [true, false, false] ===
    array_map(function($v) {
        return is_all_multibyte($v);
    },
    $data),
    [true, false, true] ===
    array_map(function($v) {
        return contains_any_multibyte($v);
    },
    $data)
);

Willow answered 31/5, 2013 at 8:55 Comment(1)

Related question: Determine if UTF-8 text is all ASCII? – Willow 4/6, 2013 at 6:6

I

10

I'm not sure if there's a better way, but a quick way that comes in mind is:

if (mb_strlen($str) != strlen($str)) {
    echo "yes";
} else {
    echo "no";
}

Imbecility answered 29/5, 2013 at 18:45 Comment(4)

But this does not compare if multibyte... i need "is_multibyte( "Escribiendo en Ruso: Йохо-хо! И бутылка рома! Now in English..." )" – Hosmer 29/5, 2013 at 18:52

strlen will count bytes, whereas mb_strlen will count characters. If there's any multi-byte character in your string, these two will be different – Imbecility 29/5, 2013 at 18:53

have difference then if mb_strlen is greater, would multibyt )) – Hosmer 29/5, 2013 at 19:5

@OlafErlandsen: that's the other way round: strlen() would return something greater than mb_strlen() – Imbecility 29/5, 2013 at 19:14

W

10

There are two interpretations. The first is that every character is multibyte. The second is that the string contains one multibyte character at least. If you have an interest for handling invalid byte sequence, see https://mcmap.net/q/145636/-replacing-invalid-utf-8-characters-by-question-marks-mbstring-substitute_character-seems-ignored for details.

function is_all_multibyte($string)
{
    // check if the string doesn't contain invalid byte sequence
    if (mb_check_encoding($string, 'UTF-8') === false) return false;

    $length = mb_strlen($string, 'UTF-8');

    for ($i = 0; $i < $length; $i += 1) {

        $char = mb_substr($string, $i, 1, 'UTF-8');

        // check if the string doesn't contain single character
        if (mb_check_encoding($char, 'ASCII')) {

            return false;

        }

    }

    return true;

}

function contains_any_multibyte($string)
{
    return !mb_check_encoding($string, 'ASCII') && mb_check_encoding($string, 'UTF-8');
}

$data = ['東京', 'Tokyo', '東京(Tokyo)'];

var_dump(
    [true, false, false] ===
    array_map(function($v) {
        return is_all_multibyte($v);
    },
    $data),
    [true, false, true] ===
    array_map(function($v) {
        return contains_any_multibyte($v);
    },
    $data)
);

Willow answered 31/5, 2013 at 8:55 Comment(1)

Related question: Determine if UTF-8 text is all ASCII? – Willow 4/6, 2013 at 6:6

M

2

To determine if something is multibyte or not you need to be specific about which character set you're using. If your character set is Latin1, for example, no strings will be multibyte. If your character set is UTF-16, every string is multibyte.

That said, if you only care about a specific character set, say utf-8, you can use a mb_strlen < strlen test if you specify the encoding parameter explicitly.

function is_multibyte($s) {
  return mb_strlen($s,'utf-8') < strlen($s);
}

Marble answered 29/5, 2013 at 19:43 Comment(0)

Recommended topics

Hot tags