Are the PHP preg_functions multibyte safe?

R

5

33

There are no multibyte 'preg' functions available in PHP, so does that mean the default preg_functions are all mb safe? Couldn't find any mention in the php documentation.

Rundown answered 19/11, 2009 at 20:58 Comment(1)

I'm 90% sure the underlieing C functions are, but that doesn't mean the PHP versions are I suppose... – Lipocaic 19/11, 2009 at 21:0

M

27

PCRE can support UTF-8 and other Unicode encodings, but it has to be specified at compile time. From the man page for PCRE 8.0:

The current implementation of PCRE corresponds approximately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode release 5.1.

PHP currently uses PCRE 7.9; your system might have an older version.

Taking a look at the PCRE lib that comes with PHP 5.2, it appears that it's configured to support Unicode properties and UTF-8. Same for the 5.3 branch.

Monge answered 19/11, 2009 at 21:6 Comment(3)

I'm using PHP 5.3.0 which includes PCRE Version 7.9, I checked the PCRE config.h file which includes the UTF8 definition, so looks like the preg_funcs are safe. Thanks very much for the info! – Rundown 19/11, 2009 at 21:50

Is there a quick way to determine which version of PCRE an existing PHP installation is using? My server for instance is running PHP 5.5, but how can I tell what PCRE library it was compiled with? – Dacoity 27/2, 2017 at 18:2

As a note to anyone using PREG_OFFSET_CAPTURE, the offset is in bytes, as such you'll want to use substr, not mb_substr and the like. – Viniferous 5/12, 2022 at 12:2

A

34

pcre supports utf8 out of the box, see documentation for the 'u' modifier.

Illustration (\xC3\xA4 is the utf8 encoding for the german letter "ä")

  echo preg_replace('~\w~', '@', "a\xC3\xA4b");

this echoes "@@¤@" because "\xC3" and "\xA4" were treated as distinct symbols

  echo preg_replace('~\w~u', '@', "a\xC3\xA4b");

(note the 'u') prints "@@@" because "\xC3\xA4" were treated as a single letter.

Antecedents answered 19/11, 2009 at 21:41 Comment(3)

Really? Hmm, I'm not overly proficient with regex strings, if you don't mind I might post some of my preg_ code to see what you think? – Rundown 19/11, 2009 at 22:8

great for u modifier, I didn't know it – Wallasey 29/9, 2015 at 14:4

I was getting error when json_encodeing a string after calling preg_replace, but failing because preg_replace converted some UTF-8 characters to the replacement character. The u modifier saved my day!!! Thanks a lot for that. – Fractocumulus 5/12, 2019 at 16:2

M

27

PCRE can support UTF-8 and other Unicode encodings, but it has to be specified at compile time. From the man page for PCRE 8.0:

The current implementation of PCRE corresponds approximately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode release 5.1.

PHP currently uses PCRE 7.9; your system might have an older version.

Taking a look at the PCRE lib that comes with PHP 5.2, it appears that it's configured to support Unicode properties and UTF-8. Same for the 5.3 branch.

Monge answered 19/11, 2009 at 21:6 Comment(3)

I'm using PHP 5.3.0 which includes PCRE Version 7.9, I checked the PCRE config.h file which includes the UTF8 definition, so looks like the preg_funcs are safe. Thanks very much for the info! – Rundown 19/11, 2009 at 21:50

Is there a quick way to determine which version of PCRE an existing PHP installation is using? My server for instance is running PHP 5.5, but how can I tell what PCRE library it was compiled with? – Dacoity 27/2, 2017 at 18:2

As a note to anyone using PREG_OFFSET_CAPTURE, the offset is in bytes, as such you'll want to use substr, not mb_substr and the like. – Viniferous 5/12, 2022 at 12:2

C

2

No, they are not. See the question preg_match and UTF-8 in PHP for example.

Capon answered 19/11, 2009 at 21:3 Comment(2)

To clarify, the PREG_OFFSET_CAPTURE produces byte offsets rather than character offsets. It's coherent with string handling in PHP but it can be pretty confusing. – Yes 2/10, 2013 at 16:23

If you use T-Regx tool, you can use offset() or byteOffset() methods to get offsets in characters or bytes. – Magdalen 28/1, 2019 at 18:13

T

1

No, you need to use the multibyte string functions like mb_ereg

Trooper answered 19/11, 2009 at 21:3 Comment(4)

They're the multi-byte version of the POSIX ereg functions, though, which aren't exactly the same as the PCRE preg functions. – Aftertaste 19/11, 2009 at 21:28

Ben S you are my hero :) I just wanted to purify texts and leave äöüß within the text. preg_replace never did this properly, but mb_ereg does! – Demirelief 19/4, 2017 at 16:18

as long as you use the /u modifier, THEY ARE MULTIBYTE SAFE, as long as that multibyte encoding is UTF-8. the /u engine doesn't support any other encodings than UTF-8 – Dibbrun 7/7, 2017 at 14:26

preg_match with /u modifier works a treat! thank you @Dibbrun – Reaves 7/11, 2021 at 14:4

R

1

Some of my more complicated preg functions:

(1a) validate username as alphanumeric + underscore:

preg_match('/^[A-Za-z][A-Za-z0-9]*(?:_[A-Za-z0-9]+)*$/',$username)

(1b) possible UTF alternative:

preg_match('/^[A-Za-z][A-Za-z0-9]*(?:_[A-Za-z0-9]+)*$/u',$username)

(2a) validate email:

preg_match("/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix",$email))

(2b) possible UTF alternative:

preg_match("/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ixu",$email))

(3a) normalize newlines:

preg_replace("/(\n){2,}/","\n\n",$str);

(3b) possible UTF alternative:

preg_replace("/(\n){2,}/u","\n\n",$str);

Do thse changes look alright?

Rundown answered 19/11, 2009 at 22:21 Comment(1)

I believe your email regular expression will allow '..' anywhere in the email address, which is something you need assertions to prevent. – Stainless 21/6, 2016 at 15:0

Recommended topics

Hot tags