Regex pattern using w.* not matching text starting with foreign characters such as Ä
Asked Answered
C

4

6

I have the following regex that I have been using successfully:

preg_match_all('/(\d+)\n(\w.*)\n(\d{3}\.\d{3}\.\d{2})\n(\d.*)\n(\d.*)/', $text, $matches)

However I have just found that if the text that the (\w.*) part matches starts with a foreign character such as Ä, then it doesn't match anything.

Can anyone help me with what the correct pattern should be instead of (\w.*) to match a string that starts with any character?

Many thanks

Cush answered 15/11, 2011 at 13:47 Comment(2)
Thanks, just tried your code - no, it doesn't workCush
Did you just made shure you don't have a encoding issue? You could also try the hexadecimal value of these umlauts. Hint: utf8_encode()Annisannissa
B
9

If you do want to match umlauts, then add the regex /u modifier, or use \pL in place of \w. That will allow the regex to match letters outside of the ASCII range.

Reference: http://www.regular-expressions.info/unicode.html
and http://php.net/manual/en/regexp.reference.unicode.php

Benis answered 15/11, 2011 at 13:55 Comment(1)
Excellent - using \pL in place of \w works perfectly! Thank you.Cush
T
3

Ä is a German Umlaut if I am not mistaken. \w Matches (in most flavors) [a-zA-Z0-9_].

You will need to match the unicode range of characters that you want.

\x{00C4} (php) equals the character you want. You will probably need to create a character class to support your unicode characters.

Thermistor answered 15/11, 2011 at 13:52 Comment(0)
M
0

you may have to switch to using unicode chars...

like for ascii you would use [\u0021-\u007e] In this case... the maybe [\u0021-\u007e\u0192-\u687]

I'm not quite sure on what range of characters you want but the \w I think only match things in the normal asci range

Mackinaw answered 15/11, 2011 at 13:55 Comment(1)
The \u syntax is not supported in php, instead it is \x{0021}-\x{007e}\x{0192}-\x{687}Typecast
J
0

Consider using:

/(\d+)\n((\p{L}|\p{N}|_).*)\n(\d{3}\.\d{3}\.\d{2})\n(\d.*)\n(\d.*)/
Jenaejenda answered 15/11, 2011 at 13:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.