Regex ignore underscores
Asked Answered
C

3

5

I have a regex ([-@.\/,':\w]*[\w])* and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can't seem to figure out how to do it... I tried adding ^[_] (e.g. (^[_][-@.\/,':\w]*[\w])*) but it just breaks up all the words into letters. I want to preserve the word matching, but I don't want to have words with underscores in them, nor words that are entirely made up of underscores.

Whats the proper way to do this?

P.S.

  • My app is written in C# (if that makes any difference).
  • I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).

Update
Here is an example:

"I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий."

The matches should be:

I.B.M.  
should  
be  
parsed  
as  
one  
word  
Russian  
should  
work  
too  
мплекс  
исторических  
событий  

Note that w_o_r_d should not get matched.

Charlettecharley answered 30/3, 2011 at 23:52 Comment(2)
^[_] should be [^_]. The former will match a _ at the beginning of the string (or line if multiline).Spline
@climbage, that definitely helped ignore underscores, but the underscores in the words still remain.Charlettecharley
T
6

Try this instead:

([-@.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*

The \w class is composed of [\p{L}\p{Nd}\p{Pc}] when you're performing Unicode matching. (Or simply [a-zA-Z0-9] if you're doing non-Unicode matching.)

It's the \p{Pc} Unicode category -- punctuation/connector -- that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.

(Further information here, "Character Classes: Word Character", and here, "Character Classes: Supported Unicode General Categories".)

Tarrel answered 31/3, 2011 at 0:33 Comment(1)
\p{L} is the same as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Lm}] which would significantly reduce your regex. See Supported Unicode General CategoriesVoltmer
J
2

Tue underscore comes from \w.

Simply use A-Za-z0-9 instead.

Joesphjoete answered 30/3, 2011 at 23:57 Comment(1)
Hey sidyll, thanks for the info, but unfortunately I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).Charlettecharley
V
1

For a more concise version of LukeH's regex, you can use simply:

([-@.\/,':\p{L}]*\p{L})*

I simply used \p{L} instead of Lu, Ll, Lt, Lo, Lm. See Supported Unicode General Categories

Voltmer answered 31/3, 2011 at 1:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.