Backtracking a balancing group in a greedy repetition may cause imbalance?
Asked Answered
R

1

4

As a generically brewed example for the purpose of this question, my intent is to match some number of a's, then an equal number of b's, plus one more b.

Examine the two patterns exhibited in this snippet (also on ideone.com):

var r1 = new Regex(@"(?xn)
   (?<A> a)+   (?<B-A> b)+    (?(A)(?!))   b
");
var r2 = new Regex(@"(?xn)
   (?<A> a)+   (?<B-A> b)+?   (?(A)(?!))   b
");

Console.WriteLine(r1.Match("aaabbb"));
// aaabbb

Console.WriteLine(r2.Match("aaabbb"));
// aabbb

Note that there is a difference in the matches of the two patterns. r1, which uses a greedy repetition on the balancing group construct, matches 3 a's and 3 b's, which is NOT as intended. r2, which uses a reluctant repetition, gives me 2 a's and 3 b's, which IS as intended.

The only way I can explain this is that when (?<B-A> b)+ backtracks to match one less b, it pops from the B stack but DOES NOT push back what was correspondingly popped from the A stack. Thus, even though one less b is now matched due to backtracking, the A stack remains empty. This is the only way I can explain how r1 can match aaabbb.

Note that using reluctant +? in r2 doesn't cause this problem. The way I see it, this is because unlike greedy repetition, a reluctant repetition doesn't have to "undo the damage" to the A stack, so-to-speak. By contrast, the greedy repetition causes as much "damage" as possible, but the backtracking fails to "leave things as they were" to the A stack.

Is this a correct analysis of what happened? And if so, is this behavior by design? Because what it basically looks like to me is that backtracking a balancing group in a greedy repetition may cause imbalance, and thus this could potentially be categorized as a bug (or at the very least a somewhat astonishing behavior that is inadequately documented).

Romanesque answered 17/9, 2010 at 2:40 Comment(10)
I cannot reproduce your observations. I pasted your code directly into Visual Studio and it outputs aabbb both times, as expected.Minuend
Works for me as well. I am using version 4.0 of the .NET framework.Extempore
@Jens, @Timwi: do you know of any other online apps where I can paste my C# snippet and have it run against various versions of the .NET framework? Because obviously the one on ideone.com gives a different output.Romanesque
RegExLib.com (regexlib.com/RETester.aspx)Grindelia
Nope, sorry. Ideone only tells that its using the 2.0 compiler, which means .NET 2.0, 3.0 or 3.5.Extempore
If you print Environment.Version on IdeOne you get 2.0.50727.1433 (runtime version). I cannot reproduce the bug targeting .Net 2.0, but I have version 2.0.50727.4206.Martinemartineau
@Extempore - It is unlikely they use the 2.0 compiler, it supports lambdas and var.Martinemartineau
@Kobi: lambda and var were introduced somewhere in 2.0, 3.0 or 3.5, all of which are using the same compiler and clr(the 2.0 one) (see the wikipeia, e.g. en.wikipedia.org/wiki/.NET_Framework#.NET_Framework_3.0)Extempore
"astonishing behavior" is that regular expressions are being used for what is clearly a non-regular context-free language :-)Kanaka
Update: With Microsoft opening much of the .Net source code, Mono will be using Microsoft's implementation for System.Text.RegularExpressions, so I'm expecting most of these Mono-specific bugs to go away: mono-project.com/docs/about-mono/releases/4.0.0 .Martinemartineau
S
2

This is a bug in Mono.

The reason why people are getting .NET-like Environment.Version on IdeOne is Mono requirement of backward compatibility with .NET, including compatibility with applications that take decisions based on the framework version.

Sobel answered 12/3, 2012 at 22:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.