Determine if UTF-8 text is all ASCII? [duplicate]
Asked Answered
B

3

6

What's the fastest way, in PHP, to determine if some given UTF-8 text is purely ASCII or not?

Burletta answered 10/11, 2010 at 18:16 Comment(0)
P
15

A possibly faster function would be to use a negative character class (since the regex can just stop when it hits the first character, and there's no need to internally capture anything):

function isAscii($str) {
    return 0 == preg_match('/[^\x00-\x7F]/', $str);
}

Without regex (based on my comment) {

function isAscii($str) {
    $len = strlen($str) {
    for ($i = 0; $i < $len; $i++) {
        if (ord($str[$i]) > 127) return false;
    }
    return true;
}

But I'd have to ask, why are you so concerned about faster? Use the more readable and easier to understand version, and only worry about optimizing it when you know it's a problem...

Edit:

Another option is mb_check_encoding:

function isAscii($str) {
    return mb_check_encoding($str, 'ASCII');
}
Palaeontology answered 10/11, 2010 at 18:41 Comment(6)
this will be run over a lot of text frequently, and I think both of those are pretty readable, so faster is definitely better here.Burletta
@philfreo: Updated an answer... But the best way for you to tell what's fastest is to actually benchmark the options using your conditions...Palaeontology
but apparently php's ord function has issues with utf-8Syntactics
No, ord() is always a single-byte "value of this byte" function.Palaeontology
Note that mb_check_encoding is extremely slow, the preg_match approach will win, always.Sapless
BENCHMARK (on small ASCII string): reg exp is the fastest approach. for loop and mb_check_encoding are ~7x slower.Khelat
G
2

Check if any byte is greater than 0x7f, or any character is above U+007F.

Guayaquil answered 10/11, 2010 at 18:17 Comment(2)
Quite simple $isNotAscii = false; for ($i=0,$len=strlen($string);$i<$len;$i++) { if (ord($string[$i]) > 127) { $isNotAscii = true; break; } }. It iterates over each character of the string looking for a character > 127...Palaeontology
I believe preg_match will be faster in this case... did not benchmark but for strings pattern matching, it almost always isDomineer
B
1
function isAscii($str) {
    return preg_match('/^([\x00-\x7F])*$/', $str);
}

// doesn't accept ASCII control characters
function isAsciiText($str) {
    return preg_match('/^([\x09\x0A\x0D\x20-\x7E])*$/', $str);
}
Burletta answered 10/11, 2010 at 18:22 Comment(1)
this will fail on some valid ASCII control charactersDoublebank

© 2022 - 2024 — McMap. All rights reserved.