Regex: capturing groups within capture groups
Asked Answered
S

1

10

Intro

(you can skip to What if... if you get bored with intros)

This question is not directed to VBScript particularly (I just used it in this case): I want to find a solution for general regular expressions usage (editors included).

This started when I wanted to create an adaptation of Example 4 where 3 capture groups are used to split data across 3 cells in MS Excel. I needed to capture one entire pattern and then, within it, capture 3 other patterns. However, in the same expression, I also needed to capture another kind of pattern and again capture 3 other patterns within it (yeah I know... but before pointing the nutjob finger, please finish reading).

I thought first of Named Capturing Groups then I realized that I should not «mix named and numbered capturing groups» since it «is not recommended because flavors are inconsistent in how the groups are numbered».

Then I looked into VBScript SubMatches and «non-capturing» groups and I got a working solution for a specific case:

For Each C In Myrange
    strPattern = "(?:^([0-9]+);([0-9]+);([0-9]+)$|^.*:([0-9]+)\s.*:([0-9]+).*:([a-zA-Z0-9]+)$)"

    If strPattern <> "" Then
        strInput = C.Value

        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = False
            .Pattern = strPattern
        End With

        Set rgxMatches = regEx.Execute(strInput)

        For Each mtx In rgxMatches
            If mtx.SubMatches(0) <> "" Then
                C.Offset(0, 1) = mtx.SubMatches(0)
                C.Offset(0, 2) = mtx.SubMatches(1)
                C.Offset(0, 3) = mtx.SubMatches(2)
            ElseIf mtx.SubMatches(3) <> "" Then
                C.Offset(0, 1) = mtx.SubMatches(3)
                C.Offset(0, 2) = mtx.SubMatches(4)
                C.Offset(0, 3) = mtx.SubMatches(5)
            Else
                C.Offset(0, 1) = "(Not matched)"
            End If
        Next
    End If
Next

Here's a demo in Rubular of the regex. In these:

124;12;3
my id1:213 my id2:232 my word:ins4yanrgx
:8587459 :18254182540215 :dcpt
0;1;2

It returns the first 2 cells with numbers and the 3rd with a number or a word. Basically I used a non-capturing group with 2 "parent" patterns ("parents" = broad patterns where I want to detect other sub-patterns). If the 1st parent pattern has a matching sub-pattern (1st capture group) then I place its value and the remaining captured groups of this pattern in the 3 cells. If not, I check if the 4th capture group (belonging to the 2nd parent pattern) was matched and place the remaining sub-patterns in the same 3 cells.

What if...

Instead of having something like this:

(?:^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever))

Something like this could be possible:

(#:^(\d+);(\d+);(\d+)$)|(#:^.*:(\d+)\s.*:(\d+).*:(\w+)$)|(#:what(ever))

Where (#: instead of creating a non-capturing group, would create a "parent" numbered capture group. In this way I could do something similar to Example 4:

C.Offset(0, 1) = regEx.Replace(strInput, "#$1")
C.Offset(0, 2) = regEx.Replace(strInput, "#$2")
C.Offset(0, 3) = regEx.Replace(strInput, "#$3")

It would search parent patterns until it finds a match in a child pattern (the first match would be returned and, ideally, wouldn't search the remaining ones).

Is there something like this already? Or am I missing something entirely from regex that allows to do this?

Other possible variations:

  • refer to the parent and child pattern directly, e.g.: #2$3 (this would be equivalent of $6 in my example);
  • create as many capturing groups as necessary within others (I guess it would be more complex but also the most interesting part as well), e.g.: with regex (same syntax) like (#:^_(?:(#:(\d+):\w+-(\d))|(#:\w+:(\d+)-(\d+)))_$)|(#:^\w+:\s+(#:(\w+);\d-(\d+))$) and fetching ##$1 in patterns like:

    _123:smt-4_ it would match in: 123
    _ott:432-10_ it would match in: 432
    yant: special;3-45235 it would match in: special

Please tell me if you noticed any mistakes or flaws in this logic, I will edit asap.

Sacramental answered 13/5, 2015 at 14:50 Comment(3)
It looks as if you are trying to make things seem more difficult than they are. You can use capture groups inside capture groups, they are numbered or named, and you can always access them like that. IMO, there is no practical need to create such a hierarchy of capture groups. Maybe .NET Captures property - Gets a collection of all the captures matched by the capturing group, in innermost-leftmost-first order is close to your requirements. Still, you cannot access them the way described.Necessary
@stribizhev thanks, I actually saw those .NET captures before (the example I wrote uses the SubMatches to get the cap.groups in the non-cap.group, by using Execute again in these I could go down in the hierarchy indefinitely I suppose). The practical need is that anyone could do a search and replace in any IDE/editor w/regex support, instead of coding loops for the same purpose... In this rubular, wouldn't it make more sense that for each match only 3 results would show up instead of 6 with 3 of them empty every time?Sacramental
That is no problem for .NET where you can use multiple named capturing groups and they are rewritten in case the match is non-empty. Try using named groups in .NET for corresponding capture groups on both sides of the alternation operatorNecessary
R
5

This is usually the case where mostly the same data is to be captured.
The only difference is in form.

There is a regex construct for that called Branch Reset.
Its offered on most Perl compatible engine's. Not Java nor Dot Net.
It mostly just saves regex resources and makes it easier to handle matches.

The alternative you mention will not help in any way, it actually just uses
more resources. You still have to see what matched to see where you are.
But you only have to check one group within a cluster to tell which other
groups are valid (<- this is unnecessary if using branch reset).

(below was constructed using RegexFormat 6)

Here is the branch reset version:

 # (?|^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever)()())

 (?|
      ^ 
      ( \d+ )                       # (1)
      ;
      ( \d+ )                       # (2)
      ;
      ( \d+ )                       # (3)
      $ 
   |  
      ^ .* :
      ( \d+ )                       # (1)
      \s .* :
      ( \d+ )                       # (2)
      .* :
      ( \w+ )                       # (3)
      $ 
   |  
      what
      ( ever )                      # (1)
      ( )                           # (2)
      ( )                           # (3)
 )

Here is your two regexes. Notice the 'parent' capturing actually increases the number of groups (which slows down the engine):

 # (?:^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever))

 (?:
      ^ 
      ( \d+ )                       # (1)
      ;
      ( \d+ )                       # (2)
      ;
      ( \d+ )                       # (3)
      $ 
   |  
      ^ .* :
      ( \d+ )                       # (4)
      \s .* :
      ( \d+ )                       # (5)
      .* :
      ( \w+ )                       # (6)
      $ 
   |  
      what
      ( ever )                      # (7)
 )

and

    # (#:^(\d+);(\d+);(\d+)$)|(#:^.*:(\d+)\s.*:(\d+).*:(\w+)$)|(#:what(ever))

    (                             # (1 start)
         \#: ^ 
         ( \d+ )                       # (2)
         ;
         ( \d+ )                       # (3)
         ;
         ( \d+ )                       # (4)
         $ 
    )                             # (1 end)
 |  
    (                             # (5 start)
         \#: ^ .* :
         ( \d+ )                       # (6)
         \s .* :
         ( \d+ )                       # (7)
         .* :
         ( \w+ )                       # (8)
         $ 
    )                             # (5 end)
 |  
    (                             # (9 start)
         \#:what
         ( ever )                      # (10)
    )                             # (9 end)
Robynroc answered 13/5, 2015 at 16:35 Comment(2)
Wow! +1 for this! Almost a bullseye shot. In regular-expressions.info they actually explain that a «branch reset group» groups the alternatives and merge their capturing groups. Tried RegexFormat 6 but it's crashing when executed in my Win 7. However, I was eager for more: to have direct access to child-patterns identified by their parents... That's why I was using that weird syntax with (#: for regex and #$1 for matched patterns (it actually doesn't exist in regex syntax... not that I know anyway :).Sacramental
@Sacramental - RegexFormat 6 installs and runs perfectly on my Win 7 machine, both 32 and 64 bit. What problem did you have and what sub-version did you install? 6.02 is the current one. I know the folks who wrote the software, I think they give out free registration keys on legitimate software bugs. Send a detailed report on how to reproduce the bug to [email protected] and you might get a free key. Back to the subject .. If I were a regex engine designer I would not buffer nested capture groups, just maintain an array/linked-list of iterators into the source string.Robynroc

© 2022 - 2024 — McMap. All rights reserved.