How do I convert the regular expression \w+ To give me the whole words in Unicode – not just ASCII?
I use .net
How do I convert the regular expression \w+ To give me the whole words in Unicode – not just ASCII?
I use .net
In .NET, \w
will match Unicode characters that are Unicode letters or digits. For example, it would match ì
and Æ
.
To just match ASCII characters, you could use [a-zA-Z0-9]
.
This works as expected for me
string foo = "Hola, la niña está gritando en alemán: Maüschen raus!";
Regex r = new Regex(@"\w+");
MatchCollection mc = r.Matches(foo);
foreach (Match ma in mc)
{
Console.WriteLine(ma.Value);
}
It outputs
Hola la niña está gritando en alemán Maüschen raus
Are you using .Match() instead of .Matches()?
Another possible explanation is that you have a non word character in what you expect to receive, like a comma.
You should take a look at http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#ECMAScript
There's also a nice Cheat Sheet for using regex in .net: http://regexlib.com/CheatSheet.aspx
The "official" Unicode identifier for letters is \p{L}
, for numbers \p{N}
. So for completeness' sake, in cases where \w
doesn't extend to Unicode letters/numbers, the equivalent for \w+
would be [\p{L}\p{N}\p{Pc}]+
. Don't forget that the underscore and other "punctuation connector" characters are also contained in \w
(so you can decide yourself whether to keep them or not).
\w
includes not just underscore, but the entire \p{Pc}
punctuation connector category :) –
Coulometer © 2022 - 2024 — McMap. All rights reserved.