Balancing groups in variable-length lookbehind [duplicate]
Asked Answered
V

2

14

TL;DR: Using capturing (and in particular balancing groups) inside .NET's lookbehinds changes the obtained captures, although it shouldn't make a difference. What is it with .NET's lookbehinds that breaks the expected behavior?

I was trying to come up with an answer to this other question, as an excuse to play around with .NET's balancing groups. However, I cannot get them to work inside a variable-length lookbehind.

First of all, note that I do not intend to use this particular solution productively. It's more for academic reasons, because I feel that there is something going on with the variable-length lookbehind which I am not aware of. And knowing that could come in handy in the future, when I actually need to use something like this to solve a problem.

Consider this input:

~(a b (c) d (e f (g) h) i) j (k (l (m) n) p) q

The goal is to match all letters, that are inside parentheses that are preceded by ~, not matter how deep down (so everything from a to i). My attempt was to check for the correct position in a lookbehind, so that I can get all letters in a single call to Matches. Here is my pattern:

(?<=~[(](?:[^()]*|(?<Depth>[(])|(?<-Depth>[)]))*)[a-z]

In the lookbehind I try to find a ~(, and then I use the named group stack Depth to count extraneous opening parentheses. As long as the parenthesis opened in ~( is never closed, the lookbehind should match. If the closing parenthesis to that is reached, (?<-Depth>...) cannot pop anything from the stack and the lookbehind should fail (that is, for all letters from j). Unfortunately, this does not work. Instead, I match a, b, c, e, f, g and m. So only these:

~(a b (c) _ (e f (g) _) _) _ (_ (_ (m) _) _) _

That seems to mean that the lookbehind cannot match anything once I have closed a single parenthesis, unless I go back down to the highest nesting level I have been to before.

Okay, this could just mean there is something odd with my regular expression, or I did not understand the balancing groups properly. But then I tried this without the lookbehind. I created a string for every letter like this:

~(z b (c) d (e f (x) y) g) h (i (j (k) l) m) n
~(a z (c) d (e f (x) y) g) h (i (j (k) l) m) n
~(a b (z) d (e f (x) y) g) h (i (j (k) l) m) n
....
~(a b (c) d (e f (x) y) g) h (i (j (k) l) z) n
~(a b (c) d (e f (x) y) g) h (i (j (k) l) m) z

And used this pattern on each of those:

~[(](?:[^()]*|(?<Depth>[(])|(?<-Depth>[)]))*z

And as desired, all cases match, where z replaces a letter between a and i and all the cases after that fail.

So what does the (variable-length) lookbehind do that breaks this use of balancing groups? I tried to research this all evening (and found pages like this one), but I could not find a single use of this in a lookbehind.

I would also be glad, if someone could link me to some in-depth information about how the .NET regex engine handles .NET-specific features internally. I found this amazing article, but it does not seem to go into (variable-length) lookbehinds, for instance.

Vltava answered 15/11, 2012 at 0:4 Comment(9)
See Fun With .NET Regex Balancing Groups as well.Pod
@OmegaMan, thanks, but I found that too (actually that was the page I based my code on). Unfortunately, this one too does not use any balanced groups in a lookbehind.Vltava
so, would a single result like "a b (c) d (e f (g) h) i" be acceptable or must it be 9 matches a, b, c, d, e, f, g, h, and i?Evergreen
@Evergreen I can easily capture all letters from a to i in a single match, using something similar to the lookbehind-free pattern. The point is less to find an algorithm that does the correct matching, but rather to find an explanation for the odd behavior when I use balancing groups and lookbehinds in combination.Vltava
I looked at it and was able to reduce the phenomenon: Something like (?<=(?<A>.)(?<-A>.)) never matches. I'd expect it to match any position after the second. (?<A>)(?<=(?<A>.)(?<-A>.))(?<-A>), however, does match these positions (though a similar approach does not work in your case). I'd note that Mono behaves exactly the same: ideone.com/rvmQhr - also for your pattern ideone.com/Hjb3jn - so maybe there's something in the spec explaining this.Eshman
@Eshman nice work! I hope that narrows it down. Although I don't quite get why your second regex works. All you do is put a dummy element at the bottom of your stack during the execution of the lookbehind, right?Vltava
Also, I have also done some useless things with .net regular expressions: kobikobi.wordpress.com/tag/regex . Mono had/has some bugs with these. A similar problem/feature I saw in .net is inability to repeat a zero-length group, but that is another issue.Eshman
@m.buettner - Yes, that's all it does. Another empty capture.Eshman
@Eshman ah cool, you are that Kobi. I had already found your blog when researching the issue myself ;). Although it didn't contain an answer tp my problem, I really liked those purely academic regex solutions you have there :DVltava
E
13

I think I got it.
First, as I mentioned in one of the comments, (?<=(?<A>.)(?<-A>.)) never matches.
But then I thought, what about (?<=(?<-A>.)(?<A>.))? It does match!
And how about (?<=(?<A>.)(?<A>.))? Matched against "12", A is captures "1", and if we look at the Captures collection, it is {"2", "1"} - first two, then one - it is reversed.
So, while inside a lookbehind, .net matches and captures from the right to the left.

Now, how can we make it capture from left to right? This is quite simple, really - we can trick the engine using a lookahead:

(?<=(?=(?<A>.)(?<A>.))..)

Applied to your original pattern, the simplest option I came up with was:

(?<=
    ~[(]
    (?=
        (?:
            [^()]
            |
            (?<Depth>[(])
            |
            (?<-Depth>[)])
        )*
        (?<=(\k<Prefix>))   # Make sure we matched until the current position
    )
    (?<Prefix>.*)           # This is captured BEFORE getting to the lookahead
)
[a-z]

The challenge here was that now the balanced part may end anywhere, so we make it reach all the way to the current position (Something like \G or \Z would be useful here, but I don't think .net has that)

It is very possible this behavior is documented somewhere, I'll try to look it up.

Here's another approach. The idea is simple - .net wants to match from right to left? Fine! Take that:
(tip: start reading from the bottom - that is how .net does it)

(?<=
    (?(Depth)(?!))  # 4. Finally, make sure there are no extra closed parentheses.
    ~\(
    (?>                     # (non backtracking)
        [^()]               # 3. Allow any other character
        |
        \( (?<-Depth>)?     # 2. When seeing an open paren, decreace depth.
                            #    Also allow excess parentheses: '~((((((a' is OK.
        |
        (?<Depth>  \) )     # 1. When seeing a closed paren, add to depth.
    )*
)
\w                          # Match your letter
Eshman answered 16/11, 2012 at 23:11 Comment(7)
I can't really find anything on msdn, not even on Right-to-Left Mode. Very minimal. Maybe it's on the BCL.Eshman
Wow, this is great. See, I noticed the reverse order of captures, but I assumed that was somehow due to the stack-like nature of the groups (which is, of course, utter nonsense; but right-to-left matching just did not occur to me at all). I will play around with this tomorrow to see how reliable this is and whether I can find other workarounds. Thanks a lot for taking the time to investigate this!Vltava
Just found this. Containing (amongst other things) "The Right-to-Left Mode that powers .NET's lookbehind...". And this book also confirms it.Vltava
Naturally, this fails on Mono: ideone.com/GuCNVM . Most of my patterns fail on Mono.Eshman
Ah that last pattern of yours is great. I was trying to do something similar, but I didn't come up with the (?<-Depth)? trick. That's really neat!Vltava
@m.buettner - Thanks! \( (?<-Depth>)? is really the same as (?<-Depth>\()|\( - nothing too special :). I looked semi-hard (~20 minutes), but couldn't find exact documentation for the order of matching (and Regex in general). Thanks for an interesting question!Eshman
+1, I was trying to figure out why my balanced-parentheses lookbehind wasn't working and this was exactly why. Just switched the <x> and <-x> groups around.Laxity
T
2

I think the problem is with the data and not the pattern. The data has 'Post' items which need to be matched such as

(a b ( c ) d e f )

where d e and f are needed to be matched. A more balanced data would be

(a b (c)(d)(e)(f))


So the tack I took on this example data required a post match situation after braces:

~(a b (c) d (e f (g) h) i) j k

where j & k should be ignored...my pattern failed and captured them.

The interesting thing is that I named the captures groups to find out where they came in and j and k came in in capture three. I leave you with, not an answer, but an attempt to see if you can improve on it.

(~                         # Anchor to a Tilde
 (                         # Note that \x28 is ( and \x29 is )      
  (                          # --- PRE ---
     (?<Paren>\x28)+          # Push on a match into Paren
     ((?<Char1>[^\x28\x29])(?:\s?))*
   )+                         # Represents Sub Group 1
  (                           #---- Closing
   ((?<Char2>[^\x28\x29])(?:\s?))*
   (?<-Paren>\x29)+           # Pop off a match from Paren

  )+  
  (
     ((?<Char3>[^\x28\x29])(?:\s?))*   # Post match possibilities
  )+

 )+
(?(Paren)(?!))    # Stop after there are not parenthesis    
)

Here is the match broken out with a tool I have created on my own (maybe one day I will publish). Note that the ˽ shows where a space was matched.

Match #0
               [0]:  ~(a˽b˽(c)˽d˽(e˽f˽(g)˽h)˽i)˽j˽k
       ["1"] → [1]:  ~(a˽b˽(c)˽d˽(e˽f˽(g)˽h)˽i)˽j˽k
       →1 Captures:  ~(a˽b˽(c)˽d˽(e˽f˽(g)˽h)˽i)˽j˽k
       ["2"] → [2]:  (e˽f˽(g)˽h)˽i)˽j˽k
       →2 Captures:  (a˽b˽(c)˽d˽, (e˽f˽(g)˽h)˽i)˽j˽k
       ["3"] → [3]:  (g
       →3 Captures:  (a˽b˽, (c, (e˽f˽, (g
       ["4"] → [4]:  g
       →4 Captures:  a˽, b˽, c, e˽, f˽, g
       ["5"] → [5]:  ˽i)
       →5 Captures:  ), ), ˽h), ˽i)
       ["6"] → [6]:  i
       →6 Captures:  ˽, h, ˽, i
       ["7"] → [7]:  
       →7 Captures:  ˽d˽, , ˽j˽k, 
       ["8"] → [8]:  k
       →8 Captures:  ˽, d˽, ˽, j˽, k
   ["Paren"] → [9]:  
  ["Char1"] → [10]:  g
      →10 Captures:  a, b, c, e, f, g
  ["Char2"] → [11]:  i
      →11 Captures:  ˽, h, ˽, i
  ["Char3"] → [12]:  k
      →12 Captures:  ˽, d, ˽, j, k
Truss answered 15/11, 2012 at 4:37 Comment(4)
That is a really useful tool! However, I think the reason you get a match for j and k is simply that you count the first opening ( on the Parens stack, so it is perfectly valid to remove the last one again. Whereas, in my pattern, I don't count the ( in ~( so I should never be able to close it. (And if I don't use a lookbehind, but use it just as a plain pattern as you did, my balancing seems to work perfectly fine). In any case, thanks for the idea of using named capturing groups for everything to find out more about the match! I'll have a look into that!Vltava
@m.buettner It was the last thing I did last night, so the post wasn't as concise as I could make it, but your problem intriguied me enough to post.Pod
And I'm genuinely thankful for your contribution ;). I'll let you know when anything comes from analyzing more captures with the lookbehind attempt.Vltava
@m.buettner For more exposure on this issue consider posting it to the MSDN Regex ForumPod

© 2022 - 2024 — McMap. All rights reserved.