php regex word boundary matching in utf-8
Asked Answered
V

4

14

I have the following php code in a utf-8 php file:

var_dump(setlocale(LC_CTYPE, 'de_DE.utf8', 'German_Germany.utf-8', 'de_DE', 'german'));
var_dump(mb_internal_encoding());
var_dump(mb_internal_encoding('utf-8'));
var_dump(mb_internal_encoding());
var_dump(mb_regex_encoding());
var_dump(mb_regex_encoding('utf-8'));
var_dump(mb_regex_encoding());
var_dump(preg_replace('/\bweiß\b/iu', 'weiss', 'weißbier'));

I would like the last regex to replace only full words and not parts of words.

On my windows computer, it returns:

string 'German_Germany.1252' (length=19)
string 'ISO-8859-1' (length=10)
boolean true
string 'UTF-8' (length=5)
string 'EUC-JP' (length=6)
boolean true
string 'UTF-8' (length=5)
string 'weißbier' (length=9)

On the webserver (linux), I get:

string(10) "de_DE.utf8"
string(10) "ISO-8859-1"
bool(true)
string(5) "UTF-8"
string(10) "ISO-8859-1"
bool(true)
string(5) "UTF-8"
string(9) "weissbier"

Thus, the regex works as I expected on windows but not on linux.

So the main question is, how should I write my regex to only match at word boundaries?

A secondary questions is how I can let windows know that I want to use utf-8 in my php application.

Virgina answered 12/3, 2010 at 13:8 Comment(0)
F
19

Even in UTF-8 mode, standard class shorthands like \w and \b are not Unicode-aware. You just have to use the Unicode shorthands, as you worked out, but you can make it a little less ugly by using lookarounds instead of alternations:

/(?<!\pL)weiß(?!\pL)/u

Notice also how I left the curly braces out of the Unicode class shorthands; you can do that when the class name consists of a single letter.

Frye answered 15/3, 2010 at 17:12 Comment(2)
+1 - \w and \b appear to work as expected in recent PHP versions but they're definitively not something you can rely on since they'll probably break when you deploy your app.Stated
Also note the accepted answer here: #4782398 if you want to use the unicode shorthands!Surveillance
L
5

Guess this was related to Bug #52971

PCRE-Meta-Characters like \b \w not working with unicode strings.

and fixed in PHP 5.3.4

PCRE extension: Fixed bug #52971 (PCRE-Meta-Characters not working with utf-8).

Libelee answered 10/12, 2016 at 10:32 Comment(0)
V
4

here is what I have found so far. By rewriting the search and replacement patterns like this:

$before = '(^|[^\p{L}])';
$after = '([^\p{L}]|$)';
var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', 'weißbier'));
// Test some other cases:
var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', 'weiß'));
var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', 'weiß bier'));
var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', ' weiß'));

I get the wanted result:

string 'weißbier' (length=9)
string 'weiss' (length=5)
string 'weiss bier' (length=10)
string ' weiss' (length=6)

on both my windows computer running apache and on the hosted linux webserver running apache.

I assume there is some better way to do this.

Also, I still would like to setlocale my windows computer to utf-8.

Virgina answered 12/3, 2010 at 14:37 Comment(0)
I
0

According to this comment, that is a bug in PHP. Does using \W instead of \b give any benefit?

Imogeneimojean answered 14/3, 2010 at 14:25 Comment(2)
Yes it was, 10 years ago.Imogeneimojean
Yes they were. Better now?Imogeneimojean

© 2022 - 2024 — McMap. All rights reserved.