TL;DR: Using capturing (and in particular balancing groups) inside .NET's lookbehinds changes the obtained captures, although it shouldn't make a difference. What is it with .NET's lookbehinds that breaks the expected behavior?
I was trying to come up with an answer to this other question, as an excuse to play around with .NET's balancing groups. However, I cannot get them to work inside a variable-length lookbehind.
First of all, note that I do not intend to use this particular solution productively. It's more for academic reasons, because I feel that there is something going on with the variable-length lookbehind which I am not aware of. And knowing that could come in handy in the future, when I actually need to use something like this to solve a problem.
Consider this input:
~(a b (c) d (e f (g) h) i) j (k (l (m) n) p) q
The goal is to match all letters, that are inside parentheses that are preceded by ~
, not matter how deep down (so everything from a
to i
). My attempt was to check for the correct position in a lookbehind, so that I can get all letters in a single call to Matches
. Here is my pattern:
(?<=~[(](?:[^()]*|(?<Depth>[(])|(?<-Depth>[)]))*)[a-z]
In the lookbehind I try to find a ~(
, and then I use the named group stack Depth
to count extraneous opening parentheses. As long as the parenthesis opened in ~(
is never closed, the lookbehind should match. If the closing parenthesis to that is reached, (?<-Depth>...)
cannot pop anything from the stack and the lookbehind should fail (that is, for all letters from j
). Unfortunately, this does not work. Instead, I match a
, b
, c
, e
, f
, g
and m
. So only these:
~(a b (c) _ (e f (g) _) _) _ (_ (_ (m) _) _) _
That seems to mean that the lookbehind cannot match anything once I have closed a single parenthesis, unless I go back down to the highest nesting level I have been to before.
Okay, this could just mean there is something odd with my regular expression, or I did not understand the balancing groups properly. But then I tried this without the lookbehind. I created a string for every letter like this:
~(z b (c) d (e f (x) y) g) h (i (j (k) l) m) n
~(a z (c) d (e f (x) y) g) h (i (j (k) l) m) n
~(a b (z) d (e f (x) y) g) h (i (j (k) l) m) n
....
~(a b (c) d (e f (x) y) g) h (i (j (k) l) z) n
~(a b (c) d (e f (x) y) g) h (i (j (k) l) m) z
And used this pattern on each of those:
~[(](?:[^()]*|(?<Depth>[(])|(?<-Depth>[)]))*z
And as desired, all cases match, where z
replaces a letter between a
and i
and all the cases after that fail.
So what does the (variable-length) lookbehind do that breaks this use of balancing groups? I tried to research this all evening (and found pages like this one), but I could not find a single use of this in a lookbehind.
I would also be glad, if someone could link me to some in-depth information about how the .NET regex engine handles .NET-specific features internally. I found this amazing article, but it does not seem to go into (variable-length) lookbehinds, for instance.
a
toi
in a single match, using something similar to the lookbehind-free pattern. The point is less to find an algorithm that does the correct matching, but rather to find an explanation for the odd behavior when I use balancing groups and lookbehinds in combination. – Vltava(?<=(?<A>.)(?<-A>.))
never matches. I'd expect it to match any position after the second.(?<A>)(?<=(?<A>.)(?<-A>.))(?<-A>)
, however, does match these positions (though a similar approach does not work in your case). I'd note that Mono behaves exactly the same: ideone.com/rvmQhr - also for your pattern ideone.com/Hjb3jn - so maybe there's something in the spec explaining this. – Eshman