There are no multibyte 'preg' functions available in PHP, so does that mean the default preg_functions are all mb safe? Couldn't find any mention in the php documentation.
PCRE can support UTF-8 and other Unicode encodings, but it has to be specified at compile time. From the man page for PCRE 8.0:
The current implementation of PCRE corresponds approximately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode release 5.1.
PHP currently uses PCRE 7.9; your system might have an older version.
Taking a look at the PCRE lib that comes with PHP 5.2, it appears that it's configured to support Unicode properties and UTF-8. Same for the 5.3 branch.
pcre supports utf8 out of the box, see documentation for the 'u' modifier.
Illustration (\xC3\xA4 is the utf8 encoding for the german letter "ä")
echo preg_replace('~\w~', '@', "a\xC3\xA4b");
this echoes "@@¤@" because "\xC3" and "\xA4" were treated as distinct symbols
echo preg_replace('~\w~u', '@', "a\xC3\xA4b");
(note the 'u') prints "@@@" because "\xC3\xA4" were treated as a single letter.
json_encode
ing a string after calling preg_replace
, but failing because preg_replace
converted some UTF-8 characters to the replacement character. The u
modifier saved my day!!! Thanks a lot for that. –
Fractocumulus PCRE can support UTF-8 and other Unicode encodings, but it has to be specified at compile time. From the man page for PCRE 8.0:
The current implementation of PCRE corresponds approximately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode release 5.1.
PHP currently uses PCRE 7.9; your system might have an older version.
Taking a look at the PCRE lib that comes with PHP 5.2, it appears that it's configured to support Unicode properties and UTF-8. Same for the 5.3 branch.
No, they are not. See the question preg_match and UTF-8 in PHP for example.
PREG_OFFSET_CAPTURE
produces byte offsets rather than character offsets. It's coherent with string handling in PHP but it can be pretty confusing. –
Yes offset()
or byteOffset()
methods to get offsets in characters or bytes. –
Magdalen No, you need to use the multibyte string functions like mb_ereg
ereg
functions, though, which aren't exactly the same as the PCRE preg
functions. –
Aftertaste preg_match
with /u
modifier works a treat! thank you @Dibbrun –
Reaves Some of my more complicated preg functions:
(1a) validate username as alphanumeric + underscore:
preg_match('/^[A-Za-z][A-Za-z0-9]*(?:_[A-Za-z0-9]+)*$/',$username)
(1b) possible UTF alternative:
preg_match('/^[A-Za-z][A-Za-z0-9]*(?:_[A-Za-z0-9]+)*$/u',$username)
(2a) validate email:
preg_match("/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix",$email))
(2b) possible UTF alternative:
preg_match("/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ixu",$email))
(3a) normalize newlines:
preg_replace("/(\n){2,}/","\n\n",$str);
(3b) possible UTF alternative:
preg_replace("/(\n){2,}/u","\n\n",$str);
Do thse changes look alright?
© 2022 - 2024 — McMap. All rights reserved.