C# - Regex Match whole words
Asked Answered
J

6

5

I need to match all the whole words containing a given a string.

string s = "ABC.MYTESTING
XYZ.YOUTESTED
ANY.TESTING";

Regex r = new Regex("(?<TM>[!\..]*TEST.*)", ...);
MatchCollection mc = r.Matches(s);

I need the result to be:

MYTESTING
YOUTESTED
TESTING

But I get:

TESTING
TESTED
.TESTING

How do I achieve this with Regular expressions.

Edit: Extended sample string.

Jovian answered 17/4, 2011 at 6:45 Comment(0)
G
4

If you were looking for all words including 'TEST', you should use

@"(?<TM>\w*TEST\w*)"

\w includes word characters and is short for [A-Za-z0-9_]

Genie answered 17/4, 2011 at 6:53 Comment(2)
@tvr: be aware \w matches [0-9a-zA-Z_]. If you don't want numbers or underscores, stick with \b.Damages
@Brad: It matches a lot more than that, but the important thing is that it doesn't match non-word characters.Frydman
S
2

Keep it simple: why not just try \w*TEST\w* as the match pattern.

Satiated answered 17/4, 2011 at 7:9 Comment(0)
T
2

I get the results you are expecting with the following:

string s = @"ABC.MYTESTING
XYZ.YOUTESTED
ANY.TESTING";

var m = Regex.Matches(s, @"(\w*TEST\w*)", RegexOptions.IgnoreCase);
Thallic answered 17/4, 2011 at 7:10 Comment(3)
+1 for verbatim strings and a (probably) correct regex, but RegexOptions.Multiline serves no purpose here.Frydman
@alan Right you are, and now removed. That snuck in from my LINQPad script.Thallic
Yeah, RegexBuddy always sneaks that in, too. Very annoying.Frydman
D
1

Try using \b. It's the regex flag for a non-word delimiter. If you wanted to match both words you could use:

/\b[a-z]+\b/i

BTW, .net doesn't need the surrounding /, and the i is just a case-insensitive match flag.

.NET Alternative:

var re = new Regex(@"\b[a-z]+\b", RegexOptions.IgnoreCase);
Damages answered 17/4, 2011 at 6:50 Comment(7)
This matches a 1-letter word, not both words.Satiated
Hmm. How do I specify that? I tried this but doesn't work: Regex r = new Regex("\b(?<TM>[!\..]*TEST.*)\b", ...);Jovian
@mousino: Indeed i did miss a quantifier, but will match both words.Damages
@tvr: Also, if you want only words starting with "TEST", use \btest[a-z]+\b, e.g. ideone.com/8KNQzDamages
@Brad Thanks for the sample code. This is small part of a larger regular expression and I cannot change now..Jovian
+1 for pointing me to a online mini IDE and debugging tool – and your first sentence was the best answer to the OP's original questionSatiated
@mousio: Sometimes less is more. ;-)Damages
R
0

Using Groups I think you can achieve it.

        string s = @"ABC.TESTING
        XYZ.TESTED";
        Regex r = new Regex(@"(?<TM>[!\..]*(?<test>TEST.*))", RegexOptions.Multiline);
        var mc= r.Matches(s);
        foreach (Match match in mc)
        {
            Console.WriteLine(match.Groups["test"]);
        }

Works exactly like you want.

BTW, your regular expression pattern should be a verbatim string ( @"")

Reprography answered 17/4, 2011 at 6:50 Comment(3)
The Multiline option isn't needed here, but IgnoreCase might be. And regarding [!\..]*, see my answer.Frydman
Yes, but I was just going with the pattern provided by the OP. The other patterns provided are better.Reprography
Take it from me: it's never a good idea to use regexes from a question without validating them. Or any code, for that matter. Or from other answers. I've been burned that way too many times. :-/Frydman
F
0
Regex r = new Regex(@"(?<TM>[^.]*TEST.*)", RegexOptions.IgnoreCase);

First, as @manojlds said, you should use verbatim strings for regexes whenever possible. Otherwise you'll have to use two backslashes in most of your regex escape sequences, not just one (e.g. [!\\..]*).

Second, if you want to match anything but a dot, that part of the regex should be [^.]*. ^ is the metacharacter that inverts the character class, not !, and . has no special meaning in that context, so it doesn't need to be escaped. But you should probably use \w* instead, or even [A-Z]*, depending on what exactly you mean by "word". [!\..] matches ! or ..

Regex r = new Regex(@"(?<TM>[A-Z]*TEST[A-Z]*)", RegexOptions.IgnoreCase);

That way you don't need to bother with word boundaries, though they don't hurt:

Regex r = new Regex(@"(?<TM>\b[A-Z]*TEST[A-Z]*\b)", RegexOptions.IgnoreCase);

Finally, if you're always taking the whole match anyway, you don't need to use a capturing group:

Regex r = new Regex(@"\b[A-Z]*TEST[A-Z]*\b", RegexOptions.IgnoreCase);

The matched text will be available via Match's Value property.

Frydman answered 17/4, 2011 at 7:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.