Is it possible to match nested brackets with a regex without using recursion or balancing groups?
Asked Answered
L

2

36

The problem: Match an arbitrarily nested group of brackets in a flavour of regex such as Java's java.util.regex that supports neither recursion nor balancing groups. I.e., match the three outer groups in:

(F(i(r(s)t))) ((S)(e)((c)(o))(n)d) (((((((Third)))))))

This exercise is purely academic, since we all know that regular expressions are not supposed to be used to match these things, just as Q-tips are not supposed to be used to clean ears.

Stack Overflow encourages self-answered questions, so I decided to create this post to share something I recently discovered.

Legion answered 7/11, 2017 at 15:49 Comment(10)
Why do you ask this question if you know Java doesn't support recursion? This statement since we all know that regular expressions are not supposed to be used to match these things is utter bologny.Kylix
Here is the regex \((?:[^()]++|(?R))*\)Kylix
@sln, I asked the question and answered it - please scroll down. That statement was sarcasm used to pre-emptively deflect those who would, as they always do, reply saying that regex is not the correct tool for this task. Unfortunately, my post here failed to prompt positive, insightful discussion, as demonstrated by your and other responses.Legion
I'm going over your regex right now. It works in Perl. I'm trying to do an equivalent in boost regex but it doesn't do undefined backreference before the fact. It could be because of this it is using a pseudo stack, and I'm gonna find out. So far I do this (?(DEFINE)(?<G2>.*)(?<G1>.*\)(?!.*\k<G2>).*))(?=\()(?:(?=.*?\((?!.*?\k<G1>)(?&G1))(?=.*?\)(?!.*?\k<G2>)(?&G2)).)+?.*?(?=\k<G1>)[^(]*(?=\k<G2>$) but it does not find any.Kylix
Yes, unfortunately I think it's as you said: boost doesn't support forward references. Did you find out anything?Legion
My bad. Of course, groups defined in and called by functions don't retain their captures for usage elsewhere. As to boost doesn't support forward references, I just changed my version so it does. You can try your regex's using it www.regexformat.com. It's a heavily modified boost engine which is now almost Perl. After some testing, it seems your regex exhibits strange behavior when the otherwise normal matches are put on different lines. Example (F(i(r(s)t))) ((S)(e)((c)(o))(n)d) (((((((Third)))))))\n matches the entire line.Kylix
(con't) where (F(i(r(s)t)))\n ((S)(e)((c)(o))(n)d) (((((((Third))))))) matches ((S)(e)((c)(o))(n)d) and (((((((Third))))))) independently but matches ((S)(e)((c)(o))(n)d) (((((((Third))))))) if there are any newlines after it, and always the last line. If you change to use Dot-All it matches any independently and works as it should. To me, this behavior is typical of unbounded assertions and is proof of flawed recursive simulation. Or, maybe not, but something you should check.Kylix
@sln, I would hesitate to call this a "simulation" of regex recursion, since the extent to which this and similar tricks can be used to replicate recursive patterns is not fully clear to me. While this post shows it can be used to replicate it for the purpose of matching balanced structures (which is its most common application), I highly doubt it can be used to replicate it in general and didn't mean to imply that in any of my writing.Legion
As for the behaviour you're observing with new lines, it's to be expected. We need all of those .* parts to potentially match line terminators in order for the method to work correctly with multiple lines. I could force this by changing every . to [\s\S] in the example (and \2$ to \2\z to supersede the 'm' modifier), but I chose to keep it as simple as possible since it is already quite difficult to understand.Legion
Related: the question for which an answer starts with "You can't parse [X]HTML with regex.".Maximin
L
52

Indeed! It's possible using forward references:

(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)

Proof

Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.

No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.

That's great and all, but I want to match inner groups too!

OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:

(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$))) 

Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.

So... how the hell does this actually work?

I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:

Note Component Description
(?=\() Make sure '(' follows before doing any hard work.
(?: Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '(' (?= This lookahead deals with finding the next '('.
.*?\((?!.*?\1) Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again
(.*\)(?!.*\2).*) Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads.
)
Handle ')' (?= This lookahead deals with finding the next ')'
.*?\)(?!.*?\2) Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*) Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient.
)
. Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.
)+? Match as few times as possible until a balanced group has been found. This is validated by the following check
Final validation .*?(?=\1) Match up to and including the last '(' found.
[^(]*(?=\2$) Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).

Conclusion

So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regular expression features - no recursion or balanced groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.

I know a lot of you use regular expressions to accomplish and help other users accomplish simpler and more practical tasks, but if there is anyone out there who shares my excitement for pushing the limits of possibility with regular expressions then I'd love to hear from you. If there is interest, I have other similar material to post.

Legion answered 7/11, 2017 at 15:49 Comment(11)
(But it is oh so fragile)(Why?)(Because: "a) you could have quoted brackets! :)")("That's not fair :(")Noe
@Noe Haha, this is just for you :) regex101.com/r/Dfao1h/1Legion
@Legion Close, but no cigar! The 4th match comes out as ("), instead of ("That's not fair :(")Noe
@Noe Small oversight, here: regex101.com/r/Dfao1h/4 . I'm a little saddened that people seem to have missed the point of this post :( Don't tell me there's a "b)" too? hahaLegion
Of course there is a "b)"! (Single quoted characters, like: '(', '"', and ')'), not to mention ("c) Escaped \"quotes\" within quotes (\")"). Then we have ("d) actual backslashes, which need escaping, like \\"), so you can't just look for \", because the \ before the " could itself be escaped! I don't think people have missed the point that the post was "academical". I think the issue is this RegEx is unmaintainable: 90 characters of magic codes for the simple version? And it keeps getting longer with more special cases, like "e) parse a regex with your regex": (\([^)]+\))Noe
@Noe I humoured you before, but now I think you're beating a dead horse. Handling character escapes and quoted strings is a very basic extension to my example and has been done many many times before. This is why I say you are missing the point of this post. The point was to introduce something that hasn't been done before. Not to introduce something and claim it will work for every conceivable use case. Many others before me have posted similar demonstrations and have received much more positive feedback. That's why I figured my post fit would it in here.Legion
@Legion Without the $ sign inside the final validation, it would work multiline too... (?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2) regex101.com/r/rKytow/1Anneal
@Denis Wrong, I'm afraid. That '$' is a necessary part of the validation. If \2 = "" (which happens when the last ')' matched is at the end of the string), (?=\2) matches at every position. It will thus match "(" in "(()", which is not desirable. Please see: regex101.com/r/rKytow/2 My regex already has the capability of matching over multiple lines; you just need to use the 's' modifier to force '.' to match line terminators. Please see: regex101.com/r/rKytow/3Legion
@Anneal I promise you don't need to change my regex in any way to get it to match over multiple lines. Your amendment only serves to break it in this case: regex101.com/r/rKytow/5 As I said: all you need is the 's' (DOTALL) modifier.Legion
I'm deeply impressed - hats off! :-) A suggestion to improve this: you could turn this in to a regex matching something containing balanced parentheses by adding [^()]* at both sides. And using named-capturing groups allows easier embedding of this into ones own regexes (where you might want to add stuff and groups outside, e.g. to parse a function call), and it becomes possibly easier to understand.Subauricular
Hi @Legion excellent post...how would one evolve the regex so to escape brackets ( and ) via escape sequence \( and \)?Sniff
D
6

Brief

Input Corrections

First of all, your input is incorrect as there's an extra parenthesis (as shown below)

(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
                                ^

Making appropriate modifications to either include or exclude the additional parenthesis, one might end up with one of the following strings:

Extra parenthesis removed

(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
                                ^

Additional parenthesis added to match extra closing parenthesis

((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^

Regex Capabilities

Second of all, this is really only truly possible in regex flavours that include the recursion capability since any other method will not properly match opening/closing brackets (as seen in the OP's solution, it matches the extra parenthesis from the incorrect input as noted above).

This means that for regex flavours that do not currently support recursion (Java, Python, JavaScript, etc.), recursion (or attempts at mimicking recursion) in regular expressions is not possible.


Input

Considering the original input is actually invalid, we'll use the following inputs to test against.

(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))

Testing against these inputs should yield the following results:

  1. INVALID (no match)
  2. VALID (match)
  3. VALID (match)

Code

There are multiple ways of matching nested groups. The solutions provided below all depend on regex flavours that include recursion capabilities (e.g. PCRE).

See regex in use here

Using DEFINE block

(?(DEFINE)
  (?<value>[^()\r\n]+)
  (?<groupVal>(?&group)|(?&value))
  (?<group>(?&value)*\((?&groupVal)\)(?&groupVal)*)
)
^(?&group)$

Note: This regex uses the flags gmx

Without DEFINE block

See regex in use here

^(?<group>
  (?<value>[^()\r\n]+)*
  \((?<groupVal>(?&group)|(?&value))\)
  (?&groupVal)*
)$

Note: This regex uses the flags gmx

Without x modifier (one-liner)

See regex in use here

^(?<group>(?<value>[^()\r\n]+)*\((?<groupVal>(?&group)|(?&value))\)(?&groupVal)*)$

Without named (groups & references)

See regex in use here

^(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*)$

Note: This is the shortest possible method that I could come up with.


Explanation

I'll explain the last regex as it's a simplified and minimal example of all the other regular expressions above it.

  • ^ Assert position at the start of the line
  • (([^()\r\n]+)*\(((?1)|(?2))\)(?3)*) Capture the following into capture group 1
    • ([^()\r\n]+)* Capture the following into capture group 2 any number of times
      • [^()\r\n]+ Match any character not present in the set ()\r\n one or more times
    • \( Match a left/opening parenthesis character ( literally
    • ((?1)|(?2)) Capture either of the following into capture group 3
      • (?1) Recurse the first subpattern (1)
      • (?2) Recurse the second subpattern (2)
    • \) Match a right/closing parenthesis character ) literally
    • (?3)* Recurse the third subpattern (3) any number of times
  • $ Assert position at the end of the line
Deitz answered 7/11, 2017 at 17:20 Comment(4)
thanks for the post and pointing out the error in the question! I didn't downvote you, but I suspect whomever did probably did so because your suggestions uses recursion, which is not novel and not the point of this discussion. Incidentally, the solution with forward references is OK, it's just the input that was not OK. The expression still correctly matches complete, properly balanced groups of parentheses (leaving out the additional ')'), as it's supposed to. To validate a line for properly balanced parenthetical groups, you can use this: regex101.com/r/Dfao1h/2Legion
yes, that was hastily made I'm afraid! This is corrected: regex101.com/r/Dfao1h/3 - I promise it's possible, and I urge you to go through the breakdown of the expression in my post and understand the method. I'm sure you'll understand how and why it works. The expression in the answer matches a full group, and the one I supplied just now validates (which is admittedly trickier).Legion
@Legion the last one does work. You should update your answer to include that regex instead of the one your answer currently includes.Deitz
I just wanted to demonstrate the simplest variation of this concept, ie. "match a group" rather than "validate a group" or "match a group with possibly quoted contents" (like the other comment), etc.Legion

© 2022 - 2024 — McMap. All rights reserved.