Capturing <thisPartOnly> and (thisPartOnly) with the same group
Asked Answered
C

6

4

Let's say we have the following input:

<amy>
(bob)
<carol)
(dean>

We also have the following regex:

<(\w+)>|\((\w+)\)

Now we get two matches (as seen on rubular.com):

  • <amy> is a match, \1 captures amy, \2 fails
  • (bob) is a match, \2 captures bob, \1 fails

This regex does most of what we want, which are:

  • It matches the open and close brackets properly (i.e. no mixing)
  • It captures the part we're interested in

However, it does have a few drawbacks:

  • The capturing pattern (i.e. the "main" part) is repeated
    • It's only \w+ in this case, but generally speaking this can be quite complex,
      • If it involves backreferences, then they must be renumbered for each alternate!
      • Repetition makes maintenance a nightmare! (what if it changes?)
  • The groups are essentially duplicated
    • Depending on which alternate matches, we must query different groups
      • It's only \1 or \2 in this case, but generally the "main" part can have capturing groups of their own!
    • Not only is this inconvenient, but there may be situations where this is not feasible (e.g. when we're using a custom regex framework that is limited to querying only one group)
  • The situation quickly worsens if we also want to match {...}, [...], etc.

So the question is obvious: how can we do this without repeating the "main" pattern?

Note: for the most part I'm interested in java.util.regex flavor, but other flavors are welcomed.


Appendix

There's nothing new in this section; it only illustrates the problem mentioned above with an example.

Let's take the above example to the next step: we now want to match these:

<amy=amy>
(bob=bob)
[carol=carol]

But not these:

<amy=amy)   # non-matching bracket
<amy=bob>   # left hand side not equal to right hand side

Using the alternate technique, we have the following that works (as seen on rubular.com):

<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\]

As explained above:

  • The main pattern can't simply be repeated; backreferences must be renumbered
  • Repetition also means maintenance nightmare if it ever changes
  • Depending on which alternate matches, we must query either \1 \2, \3 \4, or \5 \6
Covered answered 2/7, 2010 at 12:46 Comment(2)
My thought has always been that if the problem requires an appendix, maybe it's a problem best not solved through regex.Flyman
@Mark: for the most part this is a self-study, not an on-the-job scenario. For the main part of the question I used the simplest example; the appendix is more complex, but emphasizes the points stronger, which is why I thought it's worth including.Covered
F
5

You can use a lookahead to "lock in" the group number before doing the real match.

String s = "<amy=amy>(bob=bob)[carol=carol]";
Pattern p = Pattern.compile(
  "(?=[<(\\[]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\])");
Matcher m = p.matcher(s);

while(m.find())
{
  System.out.printf("found %s in %s%n", m.group(2), m.group());
}

output:

found amy in <amy=amy>
found bob in (bob=bob)
found carol in [carol=carol]

It's still ugly as hell, but you don't have to recalculate all the group numbers every time you make a change. For example, to add support for curly brackets, it's just:

"(?=[<(\\[{]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\]|\\{\\1\\})"
Fructuous answered 2/7, 2010 at 14:7 Comment(1)
+1. Oh.My.God. Genius. Doing it this way also makes the bracket pairing explicit since it's closer together instead of separated by the "main" part. I applaud you, sir.Covered
L
3

In preg (Perl Regex library), this will match your example, and \3 will catch the insides:

((<)|\()(\w+)(?(2)>|\))

It will not work in JS, though - you did not specify the dialect...

It depends on the conditional operator (?(2)...|...) which basically says if 2 is a non-null capture, then match before the pipe, else match after the pipe. In this form, pipe is not alternation ("or").

UPDATE Sorry, I completely missed the Java bit :) Anyway, apparently Java does not support the conditional construct; and I have no idea how else I'd go about it :(

Also, for your Appendix (even though it's the wrong dialect):

(?:(<)|(\()|\[)(\w+)=\3(?(1)>|(?(2)\)|]))

The name is in again in \3 (I got rid of the first capturing paren, but I had to add another one for one extra opening paren check)

Leptosome answered 2/7, 2010 at 12:54 Comment(3)
He did specify dialect - java.util.regex - I've just updated tags to reflect this.Rompers
Oops, sorry, I did not notice. I have no idea whether it works in Java... :/Leptosome
It doesn't work in Java, but it works in .NET. Java regex engine doesn't support if-else construct in regex.Squamation
C
3

The only solution that I was able to come up with is inspired by technique of capturing an empty string on different alternates; backreferencing to these groups later can serve as pseudo-conditionals.

Thus, this pattern works for the second example (as seen on rubular.com):

                  __main__
                 /        \
(?:<()|\(()|\[())((\w+)=\5)(\1>|\2\)|\3\])
\_______________/          \_____________/
    \1   \2   \3

So essentially for each opening bracket, we assign a group that captures an empty string. Then when we try to match the closing bracket, we see which group was succesful, and match the corresponding closing bracket.

The "main" part does not have to be repeated, but in Java, backreferences may have to be renumbered. This won't be a problem in flavors that support named groups.

Covered answered 2/7, 2010 at 13:14 Comment(2)
That's kind of ingenious. Ugly, but damn ingenious! +1Leptosome
@amadan: This was inspired by Alan Moore's recent answer. Looking for it right now... (found it! #3101866) -- this is essentially "the same" as your answer (+1 from me), except that it doesn't rely on direct support for conditionals.Covered
R
0

When you get things like this, using a single regex is a silly restriction, and I simply don't agree with your "maintenance nightmare" to using more than one - repeating a similar-but-different expression several times is likely to be more maintainable (well, less unmaintainable), and maybe even better performance too, than a single overly-complex regex.

But anyway, there's no repetition if you just use variables to compose your regex.

Here's some pseudo-code:

Brackets = "<>,(),[]"
CoreRegex = "(\w+)=\1"

loop CurBracket in Brackets.split(',')
{
    Input.match( Regex.quote(CurBracket.left(1)) & CoreRegex & Regex.quote(CurBracket.right(1)) )
}


(p.s.that's just to give the general idea - I'd probably use already-escaped arrays for the bracket sets in actual implementation).

Rompers answered 2/7, 2010 at 13:2 Comment(0)
I
0

May be this example in Perl will interest you :

$str = q/<amy=amy> (bob=bob) [carol=carol] <amy=amy) <amy=bob>/;
$re = qr/(?:<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\])+/;
@list = ($str =~ /$re/g);
for(@list) {
    say $i++," = ",$_;
}

I just surround your regex by (?:regex)+

Imagery answered 2/7, 2010 at 13:11 Comment(0)
C
0

Assuming there is no easy way to manually write this regular expression, why not leave it to the computer? You could have a function, maybe like below (I am using C# syntax here, as I am a bit more familiar with regexes here than in Java, but it should not be too difficult to adapt it to Java).

Note that I left the function AdaptBackreferences() more or less unimplemented as an exercise to the reader. It should just adapt the backreference numbering.

    struct BracketPair {public string Open; public string Close;};

    static string[] MatchTextInBrackets(string text, string innerPattern, BracketPair[] bracketPairs) {
        StringBuilder sb  = new StringBuilder();

        // count number of catching parentheses of innerPattern here:
        int numberOfInnerCapturingParentheses = Regex.Match("", innerPattern).Groups.Count - 1;

        bool firstTime = true;
        foreach (BracketPair pair in bracketPairs) {
            // apply logic to change backreference numbering:
            string adaptedInnerPattern = AdaptBackreferences(innerPattern);
            if (firstTime) { firstTime = false; } else { sb.Append('|'); }
            sb.Append(pair.Open).Append("(").Append(adaptedInnerPattern).Append(")").Append(pair.Close);
        }
        string myPattern = sb.ToString();
        MatchCollection matches = Regex.Matches(text, myPattern);
        string[] result = new string[matches.Count];
        for(int i=0; i < matches.Count; i++) {
            StringBuilder mb = new StringBuilder();
            for(int j=0; j < bracketPairs.Length; j++) {
                mb.Append(matches[i].Groups[1 + j * (numberOfInnerCapturingParentheses + 1)]); // append them all together, assuming all exept one are empty
            }
            result[i] = mb.ToString();
        }
        return result;
    }

    static string AdaptBackreferences(string pattern) { return pattern; } // to be written
Cutlery answered 2/7, 2010 at 14:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.