Regex word-breaker in unicode

About

Asked 25/11, 2009 at 12:22 Answered 25/11, 2009 at 12:32

.net regex unicode character-properties

How do I convert the regular expression \w+ To give me the whole words in Unicode – not just ASCII?

I use .net

Acclaim answered 25/11, 2009 at 12:22 Comment(1)

which language? thai? :D – Antiquity 25/11, 2009 at 13:20

In .NET, \w will match Unicode characters that are Unicode letters or digits. For example, it would match ì and Æ.

To just match ASCII characters, you could use [a-zA-Z0-9].

Coulometer answered 25/11, 2009 at 12:27 Comment(0)

This works as expected for me

        string foo = "Hola, la niña está gritando en alemán: Maüschen raus!";
        Regex r = new Regex(@"\w+");
        MatchCollection mc = r.Matches(foo);
        foreach (Match ma in mc)
        {
            Console.WriteLine(ma.Value);
        }

It outputs

Hola
la
niña
está
gritando
en
alemán
Maüschen
raus

Are you using .Match() instead of .Matches()?

Another possible explanation is that you have a non word character in what you expect to receive, like a comma.

Parapodium answered 25/11, 2009 at 12:28 Comment(0)

You should take a look at http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#ECMAScript
There's also a nice Cheat Sheet for using regex in .net: http://regexlib.com/CheatSheet.aspx

Raker answered 25/11, 2009 at 12:27 Comment(0)

The "official" Unicode identifier for letters is \p{L}, for numbers \p{N}. So for completeness' sake, in cases where \w doesn't extend to Unicode letters/numbers, the equivalent for \w+ would be [\p{L}\p{N}\p{Pc}]+. Don't forget that the underscore and other "punctuation connector" characters are also contained in \w (so you can decide yourself whether to keep them or not).

Walrus answered 25/11, 2009 at 12:32 Comment(1)

For further completeness, \w includes not just underscore, but the entire \p{Pc} punctuation connector category :) – Coulometer 25/11, 2009 at 12:39

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags