Regex ignore underscores

Asked 30/3, 2011 at 23:52 Answered 31/3, 2011 at 1:44

I have a regex ([-@.\/,':\w]*[\w])* and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can't seem to figure out how to do it... I tried adding ^[_] (e.g. (^[_][-@.\/,':\w]*[\w])*) but it just breaks up all the words into letters. I want to preserve the word matching, but I don't want to have words with underscores in them, nor words that are entirely made up of underscores.

Whats the proper way to do this?

P.S.

My app is written in C# (if that makes any difference).
I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).

Update
Here is an example:

"I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий."

The matches should be:

I.B.M.  
should  
be  
parsed  
as  
one  
word  
Russian  
should  
work  
too  
мплекс  
исторических  
событий

Note that w_o_r_d should not get matched.

Charlettecharley answered 30/3, 2011 at 23:52 Comment(2)

^[_] should be [^_]. The former will match a _ at the beginning of the string (or line if multiline). – Spline 30/3, 2011 at 23:56

@climbage, that definitely helped ignore underscores, but the underscores in the words still remain. – Charlettecharley 31/3, 2011 at 0:7

Try this instead:

([-@.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*

The \w class is composed of [\p{L}\p{Nd}\p{Pc}] when you're performing Unicode matching. (Or simply [a-zA-Z0-9] if you're doing non-Unicode matching.)

It's the \p{Pc} Unicode category -- punctuation/connector -- that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.

(Further information here, "Character Classes: Word Character", and here, "Character Classes: Supported Unicode General Categories".)

Tarrel answered 31/3, 2011 at 0:33 Comment(1)

\p{L} is the same as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Lm}] which would significantly reduce your regex. See Supported Unicode General Categories – Voltmer 31/3, 2011 at 1:39

Tue underscore comes from \w.

Simply use A-Za-z0-9 instead.

Joesphjoete answered 30/3, 2011 at 23:57 Comment(1)

Hey sidyll, thanks for the info, but unfortunately I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English). – Charlettecharley 31/3, 2011 at 0:0

For a more concise version of LukeH's regex, you can use simply:

([-@.\/,':\p{L}]*\p{L})*

I simply used \p{L} instead of Lu, Ll, Lt, Lo, Lm. See Supported Unicode General Categories

Voltmer answered 31/3, 2011 at 1:44 Comment(0)

Recommended topics

Hot tags