Regular expression \p{L} and \p{N}
Asked Answered
L

2

186

I am new to regular expressions and have been given the following regular expression:

(\p{L}|\p{N}|_|-|\.)*

I know what * means and | means "or" and that \ escapes.

But what I don't know what \p{L} and \p{N} means. I have searched Google for it, without result...

Can someone help me?

Loveridge answered 15/2, 2013 at 9:1 Comment(0)
N
271

\p{L} matches a single code point in the category "letter".
\p{N} matches any kind of numeric character in any script.

Source: regular-expressions.info

If you're going to work with regular expressions a lot, I'd suggest bookmarking that site, it's very useful.

Narcissus answered 15/2, 2013 at 9:3 Comment(9)
thx for the fast answer :). But shouldnt the regex then match 10? I have tried an online regex matcher: regexpal.comLoveridge
@user1093774: I don't think regexpal supports \p{}, but yes, it should match.Narcissus
This syntax is specific for modern Unicode regex implementation, which not all interpreters recognize. You can safely replace \p{L} by {a-zA-Z} (ascii notation) or {\w} (perl/vim notation); and \p{N} by {0-9} (ascii) or {\d} (perl/vim). If you want to match all of them, just do: {a-zA-Z0-9}+ or {\w\d}+Praetorian
Rafael, I dont' agree that you can safely replace \p{L} by {a-zA-Z}. {a-zA-Z}, for example, will not match any accented character, such as é, which is used all over in French. So these are only safely replaceable if you are sure that you will only be processing english, and nothing else.Stan
Does it match code point or code unit? https://mcmap.net/q/48606/-what-39-s-the-difference-between-a-character-a-code-point-a-glyph-and-a-graphemeThimbleweed
Note: if doing a regex like this in a browser, you need to pass the u flag. https://mcmap.net/q/18897/-how-can-i-use-unicode-aware-regular-expressions-in-javascriptSandbag
@Narcissus do you happen to know of a regex checker that recognizes '\\p{L}' ? I've tried several online checkers and they all fail at that.Scopula
@Scopula why the double escape \\?Narcissus
@Narcissus sorry, copied it out of QGIS and didn't pay attention. It probably needs to escape the backslash to send to the interpreter. Let's say I only sent one ;)Scopula
T
51

These are Unicode property shortcuts (\p{L} for Unicode letters, \p{N} for Unicode digits). They are supported by .NET, Perl, Java, PCRE, XML, XPath, JGSoft, Ruby (1.9 and higher) and PHP (since 5.1.0)

At any rate, that's a very strange regex. You should not be using alternation when a character class would suffice:

[\p{L}\p{N}_.-]*
Testerman answered 15/2, 2013 at 9:6 Comment(6)
its regex in xml - i have not constrcuted the regex myself :)Loveridge
Apart from the fact that capturing parentheses were used, the REs will actually compile to the same thing (well, in any optimizing RE engine that supports the \p{…} escape sequence style in the first place).Sooth
that looks like XRegExp unicode plugin. which if so, would be any alpha-numeric in any languageChaisson
Thanks, listing supporting languages was useful, unaware there were limitations there (most regex'y things being "universal").Tortuosity
@HoldOffHunger: Far from it, unfortunately. That's why there is a market for tools like RegexBuddy. Take a look at regular-expressions.info/refbasic.html, you'll be amazed at the subtle and not-so-subtle differences between regex flavors...Testerman
@TimPietzcker According to www.regular-expressions.info The PHP preg functions ... support Unicode when the /u option is appended to the regular expression. Therefore /u would need to be at the end, would it not?Infield

© 2022 - 2024 — McMap. All rights reserved.