How to write a recursive regex that matches nested parentheses?
Asked Answered
B

3

11

I am trying to write a regexp which matches nested parentheses, e.g.:

"(((text(text))))(text()()text)(casual(characters(#$%^^&&#^%#@!&**&#^*!@#^**_)))"

A string like this should be matched, cause all the nested parentheses are closed, instead:

"(((text)))(text)(casualChars*#(!&#*(!))"

Should not, or better, should match at least the first "(((text)))(text)" part.

Actually, my regexp is:

 $regex = '/( (  (\() ([^[]*?)  (?R)?  (\))  ){0,}) /x';

But it does not work properly as I am expecting. How to fix that? Where am I wrong? Thanks!

Balaklava answered 13/12, 2013 at 14:49 Comment(6)
I wrote a parser for SQL that needed to recursively do this. It is far easier to have recursive functions with a regex, than try to do this recursively with regex alone.Wedlock
You're barking up the wrong tree, a purely regex solution will probably be overly complicated and difficult to maintain. You'd be better recursively parsing the string.Zantos
Don't... Ok, in theory it can be done, but when you manage to do it, it will probably look like gliberish. Oh, look we found a bug in the regex! Ern... how do you fix that? Oh, we need to add support for brakets too! Ern... how do you add that? I tell you, you better use a more human readable parser. The fact that you are asking this, shows that you will probably be UNABLE to maintain it anyway.Berkie
Thank you for your advices, but I still want to do that anyway, could you help me? Why isn't my regexp work as I am expecting?Balaklava
@user3019105 What do you want to do with the matches exactly? (are you just validating, do you want to replace the content on the parenthesis, or just run a callback on each one) Also, do you want only the deepest parenthesis or you want them all?Berkie
I am trying to trim off the comments of an email address in respecting of the RFC 5321/5322, which describe the standard of an email address. I want to catch in a $matches[1] all the couples of nested parentheses, but I want the regexp to match only until the last ")" correctly closed bracket, so if there are other mismatched parentheses ahead, the regexp stop matching.Balaklava
B
4

When I found this answer I wasn't able to figure out how to modify the pattern to work with my own delimiters which where { and }. So my approach was to make it more generic.

Here is a script to generate the regex pattern with your own variable left and right delimiters.

$delimiter_wrap  = '~';
$delimiter_left  = '{';/* put YOUR left delimiter here.  */
$delimiter_right = '}';/* put YOUR right delimiter here. */

$delimiter_left  = preg_quote( $delimiter_left,  $delimiter_wrap );
$delimiter_right = preg_quote( $delimiter_right, $delimiter_wrap );
$pattern         = $delimiter_wrap . $delimiter_left
                 . '((?:[^' . $delimiter_left . $delimiter_right . ']++|(?R))*)'
                 . $delimiter_right . $delimiter_wrap;

/* Now you can use the generated pattern. */
preg_match_all( $pattern, $subject, $matches );
Bickering answered 13/7, 2016 at 20:7 Comment(1)
Nice, you match the whole string provided that there's always a $delimiter_right closing an opened $delimiter_leftBalaklava
H
13

This pattern works:

$pattern = '~ \( (?: [^()]+ | (?R) )*+ \) ~x';

The content inside parenthesis is simply describe:

"all that is not parenthesis OR recursion (= other parenthesis)" x 0 or more times

If you want to catch all substrings inside parenthesis, you must put this pattern inside a lookahead to obtain all overlapping results:

$pattern = '~(?= ( \( (?: [^()]+ | (?1) )*+ \) ) )~x';
preg_match_all($pattern, $subject, $matches);
print_r($matches[1]);

Note that I have added a capturing group and I have replaced (?R) by (?1):

(?R) -> refers to the whole pattern (You can write (?0) too)
(?1) -> refers to the first capturing group

What is this lookahead trick?

A subpattern inside a lookahead (or a lookbehind) doesn't match anything, it's only an assertion (a test). Thus, it allows to check the same substring several times.

If you display the whole pattern results (print_r($matches[0]);), you will see that all results are empty strings. The only way to obtain the substrings found by the subpattern inside the lookahead is to enclose the subpattern in a capturing group.

Note: the recursive subpattern can be improved like this:

\( [^()]*+ (?: (?R) [^()]* )*+ \)
Heroine answered 13/12, 2013 at 14:53 Comment(10)
I have tried it but it does not work, and also I do not have the sub-pattern catched...Is there another way?Balaklava
Thanks for the explanation. Need to play around with it in order to understand.Hayley
Sorry for my ignorance,"(?1)" is the same as "\1"? Is it a backreference? And another question: what does double "++" mean after the [^()] class?Balaklava
@user3019105: No it's different. \1 refers to the matched content by the capturing group 1. (?1) refers to the subpattern inside the capturing group 1. You only repeat the subpattern.Heroine
@user3019105: When you put (?1) inside the first capturing group itself, you obtain a recursive subpattern.Heroine
Thank you! So it's the first sub-pattern inside the $matches[1] variable in this specific case, am I right? And about "++" and "*+" what are they? I see them for the first time...is there a reference or something or could you please explain them to me? Thank you!Balaklava
@user3019105: $matches[1] contains the first capturing group results. About ++, *+, ?+, {1,3}+, these are possessive quantifiers. You can find more informations here: regular-expressions.info/possessive.htmlHeroine
So are possessive quantifiers as "Once-only subpatterns" described in the PHP Manual php.net/manual/en/regexp.reference.onlyonce.php?Balaklava
Big +1 - Thanks for the lookahead trick - Very Clever! (learn something new every day.)Spoony
This answer has been added to the Stack Overflow Regular Expression FAQ, under "Control Verbs and Recursion".Agincourt
B
4

When I found this answer I wasn't able to figure out how to modify the pattern to work with my own delimiters which where { and }. So my approach was to make it more generic.

Here is a script to generate the regex pattern with your own variable left and right delimiters.

$delimiter_wrap  = '~';
$delimiter_left  = '{';/* put YOUR left delimiter here.  */
$delimiter_right = '}';/* put YOUR right delimiter here. */

$delimiter_left  = preg_quote( $delimiter_left,  $delimiter_wrap );
$delimiter_right = preg_quote( $delimiter_right, $delimiter_wrap );
$pattern         = $delimiter_wrap . $delimiter_left
                 . '((?:[^' . $delimiter_left . $delimiter_right . ']++|(?R))*)'
                 . $delimiter_right . $delimiter_wrap;

/* Now you can use the generated pattern. */
preg_match_all( $pattern, $subject, $matches );
Bickering answered 13/7, 2016 at 20:7 Comment(1)
Nice, you match the whole string provided that there's always a $delimiter_right closing an opened $delimiter_leftBalaklava
B
0

The following code uses my Parser class (it's under CC-BY 3.0), it works on UTF-8 (thanks to my UTF8 class).

The way it works is by using a recursive function to iterate over the string. It will call itself each time it finds a (. It will also detect missmatched pairs when it reaches the end of the string without finding the corresponding ).

Also, this code takes a $callback parameter you can use to process each piece it finds. The callback recieves two parameters: 1) the string, and 2) the level (0 = deepest). Whatever the callback returns will be replaced in the contents of the string (this changes are visible at callback of higher level).

Note: the code does not includes type checks.

Non-recursive part:

function ParseParenthesis(/*string*/ $string, /*function*/ $callback)
{
    //Create a new parser object
    $parser = new Parser($string);
    //Call the recursive part
    $result = ParseParenthesisFragment($parser, $callback);
    if ($result['close'])
    {
        return $result['contents'];
    }
    else
    {
        //UNEXPECTED END OF STRING
        // throw new Exception('UNEXPECTED END OF STRING');
        return false;
    }
}

Recursive part:

function ParseParenthesisFragment(/*parser*/ $parser, /*function*/ $callback)
{
    $contents = '';
    $level = 0;
    while(true)
    {
        $parenthesis = array('(', ')');
        // Jump to the first/next "(" or ")"
        $new = $parser->ConsumeUntil($parenthesis);
        $parser->Flush(); //<- Flush is just an optimization
        // Append what we got so far
        $contents .= $new;
        // Read the "(" or ")"
        $element = $parser->Consume($parenthesis);
        if ($element === '(') //If we found "("
        {
            //OPEN
            $result = ParseParenthesisFragment($parser, $callback);
            if ($result['close'])
            {
                // It was closed, all ok
                // Update the level of this iteration
                $newLevel = $result['level'] + 1;
                if ($newLevel > $level)
                {
                    $level = $newLevel;
                }
                // Call the callback
                $new = call_user_func
                (
                    $callback,
                    $result['contents'],
                    $level
                );
                // Append what we got
                $contents .= $new;
            }
            else
            {
                //UNEXPECTED END OF STRING
                // Don't call the callback for missmatched parenthesis
                // just append and return
                return array
                (
                    'close' => false,
                    'contents' => $contents.$result['contents']
                );
            }
        }
        else if ($element == ')') //If we found a ")"
        {
            //CLOSE
            return array
            (
                'close' => true,
                'contents' => $contents,
                'level' => $level
            );
        }
        else if ($result['status'] === null)
        {
            //END OF STRING
            return array
            (
                'close' => false,
                'contents' => $contents
            );
        }
    }
}
Berkie answered 13/12, 2013 at 15:50 Comment(3)
Light years behind posting function like thisAmaty
@MartinZvarík thanks for bringing my attention to this answer, I have fixed the links. This is meant to be easier to maintain than regex.Berkie
I also love how you mentioned that it is under Creative Common license :DD ... Bro... just erase the whole thing and we will forget it was ever posted here... 5 years later you must know betterAmaty

© 2022 - 2024 — McMap. All rights reserved.