Regex is behaving lazy, should be greedy
Asked Answered
P

3

12

I thought that by default my Regex would exhibit the greedy behavior that I want, but it is not in the following code:

 Regex keywords = new Regex(@"in|int|into|internal|interface");
 var targets = keywords.ToString().Split('|');
 foreach (string t in targets)
    {
    Match match = keywords.Match(t);
    Console.WriteLine("Matched {0,-9} with {1}", t, match.Value);
    }

Output:

Matched in        with in
Matched int       with in
Matched into      with in
Matched internal  with in
Matched interface with in

Now I realize that I could get it to work for this small example if I simply sorted the keywords by length descending, but

  • I want to understand why this isn't working as expected, and
  • the actual project I am working on has many more words in the Regex and it is important to keep them in alphabetical order.

So my question is: Why is this being lazy and how do I fix it?

Procurable answered 7/3, 2010 at 2:23 Comment(4)
I am not sure if your actual usage is more complicated, but if the above example is actually what you're doing I think you would be a thousand times better off looping over your list of words looking for matches with the IndexOf method. If the regex simply contains a bunch of words in an alternation, performance will probably suck.Headmost
@Headmost - No, the example is simplified. The actual app is reading language files to generate lexers and grammar parsers. I am just a bit rusty on my regex's; my problem seems so obvious now!Procurable
@Josh: Regex engines can do a lot of optimizations for such cases, including discarding many checks after failing to match a common prefix. E.g., if the first character is not "i", none of the branches beginning with "i" would be checked. Not sure if the .NET engine does this, but I would be surprised if it didn't.Infinitive
@Max, it does build state transitions to speed up its matching. If .Net compares well to other long established and well refined regex engines is a matter of debate from what I've gathered. But it does indeed perform better than IndexOf. (I've run both in loops at work to prove why coworkers should use regex instead of IndexOf... depending on what's being matched you can get orders of magnitiude speed increase.)Photooffset
I
13

Laziness and greediness applies to quantifiers only (?, *, +, {min,max}). Alternations always match in order and try the first possible match.

Infinitive answered 7/3, 2010 at 2:29 Comment(2)
No options other than re-ordering? Hrmmm...I guess I could re-order it on the fly so I can keep the definition in alphabetical order...Procurable
@Stomp:Yes that can be done. Keep the list alphabetical in the program and just before you actually apply it, you can sort it by length.Glaive
P
6

It looks like you're trying to word break things. To do that you need the entire expression to be correct, your current one is not. Try this one instead..

new Regex(@"\b(in|int|into|internal|interface)\b");

The "\b" says to match word boundaries, and is a zero-width match. This is locale dependent behavior, but in general this means whitespace and punctuation. Being a zero width match it will not contain the character that caused the regex engine to detect the word boundary.

Photooffset answered 7/3, 2010 at 3:28 Comment(3)
Adding \b will elicit the desired behavior, but you're mistaken about how it works. \b is a zero-width assertion like ^, $, and lookarounds; instead of matching a character, it matches the imaginary gap before or after a character. The beginning or end of a string is automatically a word boundary if the first or last character (respectively) is a word character, so your second regex is just a more verbose version of the first.Bud
@Alan, I tried executing the code, and clearly you're right. I'll need to double check the code at work to see what we're doing there... Perhaps we're using \W and not \b. I know we were getting "non-word" characters of some sort in a similar situation where I know we had some funky noon-capturing groups setup. As for it being locale sensitive, that's going to be the case as word boundaries will be defined differently based on the role of punctuation.Photooffset
@Alan, I amended my answer to reflect your feedback.Photooffset
C
3

According to RegularExpressions.info, regular expressions are eager. Therefore, when it goes through your piped expression, it stops on the first solid match.

My recommendation would be to store all of your keywords in an array or list, then generate the sorted, piped expression when you need it. You would only have to do this once too as long as your keyword list doesn't change. Just store the generated expression in a singleton of some sort and return that on regex executions.

Consequence answered 7/3, 2010 at 2:37 Comment(1)
@Consequence - Thanks for the links! I was searching on MSDN and must have missed that it was eagerly looking for the first match.Procurable

© 2022 - 2024 — McMap. All rights reserved.