Split a string by commas which are not inside potentially nested parentheses
Asked Answered
W

10

7

Two days ago I started working on a code parser and I'm stuck.

How can I split a string by commas that are not inside brackets? Let me show you what I mean.

I have this string to parse:

one, two, three, (four, (five, six), (ten)), seven

I would like to get this result:

array(
 "one"; 
 "two"; 
 "three"; 
 "(four, (five, six), (ten))"; 
 "seven"
)

but instead I get:

array(
  "one"; 
  "two"; 
  "three"; 
  "(four"; 
  "(five"; 
  "six)"; 
  "(ten))";
  "seven"
)

How can I do this in PHP RegEx.

Wirework answered 5/7, 2009 at 20:27 Comment(0)
L
13

You can do that easier:

preg_match_all('/[^(,\s]+|\([^)]+\)/', $str, $matches)

But it would be better if you use a real parser. Maybe something like this:

$str = 'one, two, three, (four, (five, six), (ten)), seven';
$buffer = '';
$stack = array();
$depth = 0;
$len = strlen($str);
for ($i=0; $i<$len; $i++) {
    $char = $str[$i];
    switch ($char) {
    case '(':
        $depth++;
        break;
    case ',':
        if (!$depth) {
            if ($buffer !== '') {
                $stack[] = $buffer;
                $buffer = '';
            }
            continue 2;
        }
        break;
    case ' ':
        if (!$depth) {
            continue 2;
        }
        break;
    case ')':
        if ($depth) {
            $depth--;
        } else {
            $stack[] = $buffer.$char;
            $buffer = '';
            continue 2;
        }
        break;
    }
    $buffer .= $char;
}
if ($buffer !== '') {
    $stack[] = $buffer;
}
var_dump($stack);
Leftward answered 5/7, 2009 at 22:7 Comment(5)
Yes, it's easier, but doesn't work in case of nested brackets, like so: one, two, three, (four, (five, six), (ten)), sevenWirework
That’s the point where you have to use a real parser. Regular expressions cannot count or handle states.Leftward
I have to use regular expressions. Regular expressions are recursive and greedy, you can accomplish this using them.Wirework
No you can’t. Sure, there are features in modern implementations that can accomplish that such like .NET’s Balancing group (?<name1-name2> … ) msdn.microsoft.com/bs2twtah.aspx. But they use a state machine and that’s no longer a regular expression in the classical manner.Leftward
This one is more correct, but still not working for nested parenthesis /[^(,]*(?:([^)]+))?[^),]*/Kenley
S
6

Hm... OK already marked as answered, but since you asked for an easy solution I will try nevertheless:

$test = "one, two, three, , , ,(four, five, six), seven, (eight, nine)";
$split = "/([(].*?[)])|(\w)+/";
preg_match_all($split, $test, $out);
print_r($out[0]);              

Output

Array
(
    [0] => one
    [1] => two
    [2] => three
    [3] => (four, five, six)
    [4] => seven
    [5] => (eight, nine)
)
Seriate answered 5/7, 2009 at 21:54 Comment(1)
Thank you very much, your help is much appreciated. But now I realize that I will also encounter nested brackets and your solution doesn't apply.Wirework
S
4

You can't, directly. You'd need, at minimum, variable-width lookbehind, and last I knew PHP's PCRE only has fixed-width lookbehind.

My first recommendation would be to first extract parenthesized expressions from the string. I don't know anything about your actual problem, though, so I don't know if that will be feasible.

Shedevil answered 5/7, 2009 at 20:36 Comment(2)
Yes, that was the hack I was planing to use. Replace the brackets with $1, $2 or something similar, split the string and than restore the brackets in the result. Thank you !Wirework
The point is that what you describe is not a regular language, so regular expressions are an ill fit. So, parsing out all the nested parts first is not a "hack" but the most sensible thing to do.Fishwife
C
2

I can't think of a way to do it using a single regex, but it's quite easy to hack together something that works:

function process($data)
{
        $entries = array();
        $filteredData = $data;
        if (preg_match_all("/\(([^)]*)\)/", $data, $matches)) {
                $entries = $matches[0];
                $filteredData = preg_replace("/\(([^)]*)\)/", "-placeholder-", $data);
        }

        $arr = array_map("trim", explode(",", $filteredData));

        if (!$entries) {
                return $arr;
        }

        $j = 0;
        foreach ($arr as $i => $entry) {
                if ($entry != "-placeholder-") {
                        continue;
                }

                $arr[$i] = $entries[$j];
                $j++;
        }

        return $arr;
}

If you invoke it like this:

$data = "one, two, three, (four, five, six), seven, (eight, nine)";
print_r(process($data));

It outputs:

Array
(
    [0] => one
    [1] => two
    [2] => three
    [3] => (four, five, six)
    [4] => seven
    [5] => (eight, nine)
)
Cautery answered 5/7, 2009 at 20:53 Comment(3)
Thank you very much, this should work. This was how I planned to do it first, but I thought that an easier way exists.Wirework
You're method can not parse "one, two, three, ((five), (four(six))), seven, eight, nine". I think the correct RegEx would be a recursive one: /(([^()]+|(?R))*)/.Wirework
You didn't mention that it had to be able to parse recursive expressions back when I first wrote this answer, though. Still, others have definately suggested better solutions after I wrote this.Cautery
H
2

Maybe a bit late but I've made a solution without regex which also supports nesting inside brackets. Anyone let me know what you guys think:

$str = "Some text, Some other text with ((95,3%) MSC)";
$arr = explode(",",$str);

$parts = [];
$currentPart = "";
$bracketsOpened = 0;
foreach ($arr as $part){
    $currentPart .= ($bracketsOpened > 0 ? ',' : '').$part;
    if (stristr($part,"(")){
        $bracketsOpened ++;
    }
    if (stristr($part,")")){
        $bracketsOpened --;                 
    }
    if (!$bracketsOpened){
        $parts[] = $currentPart;
        $currentPart = '';
    }
}

Gives me the output:

Array
(
    [0] => Some text
    [1] =>  Some other text with ((95,3%) MSC)
)
Hurry answered 15/4, 2022 at 13:9 Comment(1)
Even if I explode on comma space, I get a few broken results with this snippet when I offer some challenging parenthetical nesting. 3v4l.org/Cvmi8Kampong
H
1

Clumsy, but it does the job...

<?php

function split_by_commas($string) {
  preg_match_all("/\(.+?\)/", $string, $result); 
  $problem_children = $result[0];
  $i = 0;
  $temp = array();
  foreach ($problem_children as $submatch) { 
    $marker = '__'.$i++.'__';
    $temp[$marker] = $submatch;
    $string   = str_replace($submatch, $marker, $string);  
  }
  $result = explode(",", $string);
  foreach ($result as $key => $item) {
    $item = trim($item);
    $result[$key] = isset($temp[$item])?$temp[$item]:$item;
  }
  return $result;
}


$test = "one, two, three, (four, five, six), seven, (eight, nine), ten";

print_r(split_by_commas($test));

?>
Huebner answered 5/7, 2009 at 21:7 Comment(0)
O
1

I feel that its worth noting, that you should always avoid regular expressions when you possibly can. To that end, you should know that for PHP 5.3+ you could use str_getcsv(). However, if you're working with files (or file streams), such as CSV files, then the function fgetcsv() might be what you need, and its been available since PHP4.

Lastly, I'm surprised nobody used preg_split(), or did it not work as needed?

Outfield answered 6/7, 2009 at 6:6 Comment(3)
Yes ken, I want to use preg_split(), but what would be the RegEx that ignores commas in brackets ?Wirework
Ah yes, good point, after trying for a min or 2 I can see that its challenging with the conditions set forth.Outfield
Yeah you are right, I also tried your solution and doesn't work. Thank you still.Wirework
I
0

I am afraid that it could be very difficult to parse nested brackets like one, two, (three, (four, five)) only with RegExp.

Instanter answered 5/7, 2009 at 20:32 Comment(1)
This looks more like a comment than an answer. Where is the resolving advice?Kampong
K
0

This one is more correct, but still not working for nested parenthesis /[^(,]*(?:([^)]+))?[^),]*/
– DarkSide Mar 24, 2013 at 23:09

You're method can not parse "one, two, three, ((five), (four(six))), seven, eight, nine". I think the correct RegEx would be a recursive one: /(([^()]+|(?R))*)/.
– Cristian Toma Jul 6, 2009 at 7:26

Yes, it's easier, but doesn't work in case of nested brackets, like so: one, two, three, (four, (five, six), (ten)), seven
– Cristian Toma Jul 6, 2009 at 7:41

Thank you very much, your help is much appreciated. But now I realize that I will also encounter nested brackets and your solution doesn't apply.
– Cristian Toma Jul 6, 2009 at 7:43

Sounds to me that we need to have a string splitting algorithm that respects balanced parenthetical grouping. I'll give that a crack using a recursive regex pattern! The behavior will be to respect the lowest balanced parentheticals and let any higher level un-balanced parentheticals be treated as non-grouping characters. Please leave a comment with any input strings that are not correctly split so that I can try to make improvements (test driven development).

Code: (Demo)

$tests = [
    'one, two, three, (four, five, six), seven, (eight, nine)',
    '()',
    'one and a ),',
    '(one, two, three)',
    'one, (, two',
    'one, two, ), three',
    'one, (unbalanced, (nested, whoops ) two',
    'one, two, three and a half, ((five), (four(six))), seven, eight, nine',
    'one, (two, (three and a half, (four, (five, (six, seven), eight)))), nine, (ten, twenty twen twen)',
    'ten, four, (,), good buddy',
];

foreach ($tests as $test) {
    var_export(
        preg_split(
            '/(?>(\((?:(?>[^()]+)|(?1))*\))|[^,]+)\K,?\s*/',
            $test,
            0,
            PREG_SPLIT_NO_EMPTY
        )
    );
    echo "\n";
}

Output:

array (
  0 => 'one',
  1 => 'two',
  2 => 'three',
  3 => '(four, five, six)',
  4 => 'seven',
  5 => '(eight, nine)',
)
array (
  0 => '()',
)
array (
  0 => 'one and a )',
)
array (
  0 => '(one, two, three)',
)
array (
  0 => 'one',
  1 => '(',
  2 => 'two',
)
array (
  0 => 'one',
  1 => 'two',
  2 => ')',
  3 => 'three',
)
array (
  0 => 'one',
  1 => '(unbalanced',
  2 => '(nested, whoops )',
  3 => 'two',
)
array (
  0 => 'one',
  1 => 'two',
  2 => 'three and a half',
  3 => '((five), (four(six)))',
  4 => 'seven',
  5 => 'eight',
  6 => 'nine',
)
array (
  0 => 'one',
  1 => '(two, (three and a half, (four, (five, (six, seven), eight))))',
  2 => 'nine',
  3 => '(ten, twenty twen twen)',
)
array (
  0 => 'ten',
  1 => 'four',
  2 => '(,)',
  3 => 'good buddy',
)

Here's a related answer which recursively traverses parenthetical groups and reverses the order of comma separated values on each level: Reverse the order of parenthetically grouped text and reverse the order of parenthetical groups

Kampong answered 5/6 at 2:9 Comment(0)
C
0

This can be done with a recursive pattern (?R):

$regex = "/[(](?:[^()]*(?R))*[^()]*[)]|(?=\\S)[^,()]+(?<=\\S)/";

// Example
$s = "one, two, three, (four, (five, six), (ten)), seven";
preg_match_all($regex, $s, $matches);

$matches[0] will be:

[
    "one",
    "two",
    "three",
    "(four, (five, six), (ten))",
    "seven"
]

Explanation:

The second part of the pattern defines the basic case:

(?=\\S)[^,()]+(?<=\\S)
  • [^,()]+: a sequence of characters that exclude ,, (, and ).
  • (?=\\S): asserts that the above mentioned sequence starts with a non-whitespace character
  • (?<=\\S): asserts that the above mentioned sequence ends with a non-whitespace character

The first part of the regex deals with the parentheses:

[(](?:[^()]*(?R))*[^()]*[)]
  • [(]: the match must start with an opening brace
  • [)]: the match must end with a closing brace
  • [^()]*: a possibly empty sequence of characters that are parentheses. Note that commas are captured here.
  • (?R): applying the complete regex pattern recursively
  • (?:[^()]*(?R))*: zero or more repetitions of a sequence of characters that are not parentheses, followed by a nested expression.

So it matches something between parentheses, where that something can have any number of nested matches, alternated by non-parentheses sequences.

Carcass answered 30/6 at 11:50 Comment(1)
I'm noticing that in scenarios where there are unbalanced parentheses, the unmatched parentheses are lost. I'm not saying that this is a failure in your pattern (because this is going beyond the scope written in the question), I just want researcher to understand the behavior. 3v4l.org/IHQFfKampong

© 2022 - 2024 — McMap. All rights reserved.