How to write a recursive regex that matches nested parentheses?

Asked 13/12, 2013 at 14:49 Answered 13/7, 2016 at 20:7

I am trying to write a regexp which matches nested parentheses, e.g.:

"(((text(text))))(text()()text)(casual(characters(#$%^^&&#^%#@!&**&#^*!@#^**_)))"

A string like this should be matched, cause all the nested parentheses are closed, instead:

"(((text)))(text)(casualChars*#(!&#*(!))"

Should not, or better, should match at least the first "(((text)))(text)" part.

Actually, my regexp is:

 $regex = '/( (  (\() ([^[]*?)  (?R)?  (\))  ){0,}) /x';

But it does not work properly as I am expecting. How to fix that? Where am I wrong? Thanks!

Balaklava answered 13/12, 2013 at 14:49 Comment(6)

I wrote a parser for SQL that needed to recursively do this. It is far easier to have recursive functions with a regex, than try to do this recursively with regex alone. – Wedlock 13/12, 2013 at 14:52

You're barking up the wrong tree, a purely regex solution will probably be overly complicated and difficult to maintain. You'd be better recursively parsing the string. – Zantos 13/12, 2013 at 14:52

Don't... Ok, in theory it can be done, but when you manage to do it, it will probably look like gliberish. Oh, look we found a bug in the regex! Ern... how do you fix that? Oh, we need to add support for brakets too! Ern... how do you add that? I tell you, you better use a more human readable parser. The fact that you are asking this, shows that you will probably be UNABLE to maintain it anyway. – Berkie 13/12, 2013 at 14:53

Thank you for your advices, but I still want to do that anyway, could you help me? Why isn't my regexp work as I am expecting? – Balaklava 13/12, 2013 at 15:3

@user3019105 What do you want to do with the matches exactly? (are you just validating, do you want to replace the content on the parenthesis, or just run a callback on each one) Also, do you want only the deepest parenthesis or you want them all? – Berkie 13/12, 2013 at 15:14

I am trying to trim off the comments of an email address in respecting of the RFC 5321/5322, which describe the standard of an email address. I want to catch in a $matches[1] all the couples of nested parentheses, but I want the regexp to match only until the last ")" correctly closed bracket, so if there are other mismatched parentheses ahead, the regexp stop matching. – Balaklava 13/12, 2013 at 15:29

When I found this answer I wasn't able to figure out how to modify the pattern to work with my own delimiters which where { and }. So my approach was to make it more generic.

Here is a script to generate the regex pattern with your own variable left and right delimiters.

$delimiter_wrap  = '~';
$delimiter_left  = '{';/* put YOUR left delimiter here.  */
$delimiter_right = '}';/* put YOUR right delimiter here. */

$delimiter_left  = preg_quote( $delimiter_left,  $delimiter_wrap );
$delimiter_right = preg_quote( $delimiter_right, $delimiter_wrap );
$pattern         = $delimiter_wrap . $delimiter_left
                 . '((?:[^' . $delimiter_left . $delimiter_right . ']++|(?R))*)'
                 . $delimiter_right . $delimiter_wrap;

/* Now you can use the generated pattern. */
preg_match_all( $pattern, $subject, $matches );

Bickering answered 13/7, 2016 at 20:7 Comment(1)

Nice, you match the whole string provided that there's always a $delimiter_right closing an opened $delimiter_left – Balaklava 14/7, 2016 at 7:30

This pattern works:

$pattern = '~ \( (?: [^()]+ | (?R) )*+ \) ~x';

The content inside parenthesis is simply describe:

"all that is not parenthesis OR recursion (= other parenthesis)" x 0 or more times

If you want to catch all substrings inside parenthesis, you must put this pattern inside a lookahead to obtain all overlapping results:

$pattern = '~(?= ( \( (?: [^()]+ | (?1) )*+ \) ) )~x';
preg_match_all($pattern, $subject, $matches);
print_r($matches[1]);

Note that I have added a capturing group and I have replaced (?R) by (?1):

(?R) -> refers to the whole pattern (You can write (?0) too)
(?1) -> refers to the first capturing group

What is this lookahead trick?

A subpattern inside a lookahead (or a lookbehind) doesn't match anything, it's only an assertion (a test). Thus, it allows to check the same substring several times.

If you display the whole pattern results (print_r($matches[0]);), you will see that all results are empty strings. The only way to obtain the substrings found by the subpattern inside the lookahead is to enclose the subpattern in a capturing group.

Note: the recursive subpattern can be improved like this:

\( [^()]*+ (?: (?R) [^()]* )*+ \)

Heroine answered 13/12, 2013 at 14:53 Comment(10)

I have tried it but it does not work, and also I do not have the sub-pattern catched...Is there another way? – Balaklava 13/12, 2013 at 15:1

Thanks for the explanation. Need to play around with it in order to understand. – Hayley 13/12, 2013 at 15:8

Sorry for my ignorance,"(?1)" is the same as "\1"? Is it a backreference? And another question: what does double "++" mean after the [^()] class? – Balaklava 13/12, 2013 at 15:34

@user3019105: No it's different. \1 refers to the matched content by the capturing group 1. (?1) refers to the subpattern inside the capturing group 1. You only repeat the subpattern. – Heroine 13/12, 2013 at 15:50

@user3019105: When you put (?1) inside the first capturing group itself, you obtain a recursive subpattern. – Heroine 13/12, 2013 at 15:56

Thank you! So it's the first sub-pattern inside the $matches[1] variable in this specific case, am I right? And about "++" and "*+" what are they? I see them for the first time...is there a reference or something or could you please explain them to me? Thank you! – Balaklava 13/12, 2013 at 15:59

@user3019105: $matches[1] contains the first capturing group results. About ++, *+, ?+, {1,3}+, these are possessive quantifiers. You can find more informations here: regular-expressions.info/possessive.html – Heroine 13/12, 2013 at 16:5

So are possessive quantifiers as "Once-only subpatterns" described in the PHP Manual php.net/manual/en/regexp.reference.onlyonce.php? – Balaklava 13/12, 2013 at 16:57

Big +1 - Thanks for the lookahead trick - Very Clever! (learn something new every day.) – Spoony 13/12, 2013 at 18:25

This answer has been added to the Stack Overflow Regular Expression FAQ, under "Control Verbs and Recursion". – Agincourt 10/4, 2014 at 1:10

When I found this answer I wasn't able to figure out how to modify the pattern to work with my own delimiters which where { and }. So my approach was to make it more generic.

Here is a script to generate the regex pattern with your own variable left and right delimiters.

$delimiter_wrap  = '~';
$delimiter_left  = '{';/* put YOUR left delimiter here.  */
$delimiter_right = '}';/* put YOUR right delimiter here. */

$delimiter_left  = preg_quote( $delimiter_left,  $delimiter_wrap );
$delimiter_right = preg_quote( $delimiter_right, $delimiter_wrap );
$pattern         = $delimiter_wrap . $delimiter_left
                 . '((?:[^' . $delimiter_left . $delimiter_right . ']++|(?R))*)'
                 . $delimiter_right . $delimiter_wrap;

/* Now you can use the generated pattern. */
preg_match_all( $pattern, $subject, $matches );

Bickering answered 13/7, 2016 at 20:7 Comment(1)

Nice, you match the whole string provided that there's always a $delimiter_right closing an opened $delimiter_left – Balaklava 14/7, 2016 at 7:30

The following code uses my Parser class (it's under CC-BY 3.0), it works on UTF-8 (thanks to my UTF8 class).

The way it works is by using a recursive function to iterate over the string. It will call itself each time it finds a (. It will also detect missmatched pairs when it reaches the end of the string without finding the corresponding ).

Also, this code takes a $callback parameter you can use to process each piece it finds. The callback recieves two parameters: 1) the string, and 2) the level (0 = deepest). Whatever the callback returns will be replaced in the contents of the string (this changes are visible at callback of higher level).

Note: the code does not includes type checks.

Non-recursive part:

function ParseParenthesis(/*string*/ $string, /*function*/ $callback)
{
    //Create a new parser object
    $parser = new Parser($string);
    //Call the recursive part
    $result = ParseParenthesisFragment($parser, $callback);
    if ($result['close'])
    {
        return $result['contents'];
    }
    else
    {
        //UNEXPECTED END OF STRING
        // throw new Exception('UNEXPECTED END OF STRING');
        return false;
    }
}

Recursive part:

function ParseParenthesisFragment(/*parser*/ $parser, /*function*/ $callback)
{
    $contents = '';
    $level = 0;
    while(true)
    {
        $parenthesis = array('(', ')');
        // Jump to the first/next "(" or ")"
        $new = $parser->ConsumeUntil($parenthesis);
        $parser->Flush(); //<- Flush is just an optimization
        // Append what we got so far
        $contents .= $new;
        // Read the "(" or ")"
        $element = $parser->Consume($parenthesis);
        if ($element === '(') //If we found "("
        {
            //OPEN
            $result = ParseParenthesisFragment($parser, $callback);
            if ($result['close'])
            {
                // It was closed, all ok
                // Update the level of this iteration
                $newLevel = $result['level'] + 1;
                if ($newLevel > $level)
                {
                    $level = $newLevel;
                }
                // Call the callback
                $new = call_user_func
                (
                    $callback,
                    $result['contents'],
                    $level
                );
                // Append what we got
                $contents .= $new;
            }
            else
            {
                //UNEXPECTED END OF STRING
                // Don't call the callback for missmatched parenthesis
                // just append and return
                return array
                (
                    'close' => false,
                    'contents' => $contents.$result['contents']
                );
            }
        }
        else if ($element == ')') //If we found a ")"
        {
            //CLOSE
            return array
            (
                'close' => true,
                'contents' => $contents,
                'level' => $level
            );
        }
        else if ($result['status'] === null)
        {
            //END OF STRING
            return array
            (
                'close' => false,
                'contents' => $contents
            );
        }
    }
}

Berkie answered 13/12, 2013 at 15:50 Comment(3)

Light years behind posting function like this – Amaty 28/11, 2018 at 16:55

@MartinZvarík thanks for bringing my attention to this answer, I have fixed the links. This is meant to be easier to maintain than regex. – Berkie 28/11, 2018 at 23:17

I also love how you mentioned that it is under Creative Common license :DD ... Bro... just erase the whole thing and we will forget it was ever posted here... 5 years later you must know better – Amaty 29/11, 2018 at 1:53

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Here is a script to generate the regex pattern with your own variable left and right delimiters.

Here is a script to generate the regex pattern with your own variable left and right delimiters.

Recommended topics

Hot tags