PHP: split string on comma, but NOT when between braces or quotes?
Asked Answered
I

2

8

In PHP I have the following string :

$str = "AAA, BBB, (CCC,DDD), 'EEE', 'FFF,GGG', ('HHH','III'), (('JJJ','KKK'), LLL, (MMM,NNN)) , OOO"; 

I need to split this string into the following parts:

AAA
BBB
(CCC,DDD)
'EEE'
'FFF,GGG'
('HHH','III')
(('JJJ','KKK'),LLL, (MMM,NNN))
OOO

I tried several regexes, but couldn't find a solution. Any ideas?

UPDATE

I've decided using regex is not really the best solution, when dealing with malformed data, escaped quotes, etc.

Thanks to suggestions made on here, I found a function that uses parsing, which I rewrote to suit my needs. It can handle different kind of brackets and the separator and quote are parameters as well.

 function explode_brackets($str, $separator=",", $leftbracket="(", $rightbracket=")", $quote="'", $ignore_escaped_quotes=true ) {

    $buffer = '';
    $stack = array();
    $depth = 0;
    $betweenquotes = false;
    $len = strlen($str);
    for ($i=0; $i<$len; $i++) {
      $previouschar = $char;
      $char = $str[$i];
      switch ($char) {
        case $separator:
          if (!$betweenquotes) {
            if (!$depth) {
              if ($buffer !== '') {
                $stack[] = $buffer;
                $buffer = '';
              }
              continue 2;
            }
          }
          break;
        case $quote:
          if ($ignore_escaped_quotes) {
            if ($previouschar!="\\") {
              $betweenquotes = !$betweenquotes;
            }
          } else {
            $betweenquotes = !$betweenquotes;
          }
          break;
        case $leftbracket:
          if (!$betweenquotes) {
            $depth++;
          }
          break;
        case $rightbracket:
          if (!$betweenquotes) {
            if ($depth) {
              $depth--;
            } else {
              $stack[] = $buffer.$char;
              $buffer = '';
              continue 2;
            }
          }
          break;
        }
        $buffer .= $char;
    }
    if ($buffer !== '') {
      $stack[] = $buffer;
    }

    return $stack;
  }
Incursion answered 5/3, 2013 at 20:56 Comment(2)
How about this one: #1085264Philina
what if I have to do this in MySQL layer, not PHP?Fretted
H
9

Instead of a preg_split, do a preg_match_all:

$str = "AAA, BBB, (CCC,DDD), 'EEE', 'FFF,GGG', ('HHH','III'), (('JJJ','KKK'), LLL, (MMM,NNN)) , OOO"; 

preg_match_all("/\((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+/", $str, $matches);

print_r($matches);

will print:

Array
(
    [0] => Array
        (
            [0] => AAA
            [1] => BBB
            [2] => (CCC,DDD)
            [3] => 'EEE'
            [4] => 'FFF,GGG'
            [5] => ('HHH','III')
            [6] => (('JJJ','KKK'), LLL, (MMM,NNN))
            [7] => OOO
        )

)

The regex \((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+ can be divided in three parts:

  1. \((?:[^()]|(?R))+\), which matches balanced pairs of parenthesis
  2. '[^']*' matching a quoted string
  3. [^(),\s]+ which matches any char-sequence not consisting of '(', ')', ',' or white-space chars
Honshu answered 5/3, 2013 at 21:11 Comment(7)
While you can match, it generally provides no guarantee when it is run against a bad input string.Chausses
Hi Bart, thanks a lot. Could you think of any way to make 'FFF,GGG' appear as 1 match?Incursion
Thanks again, it works great now, so I'll accept your answer as the right one. But I still decided to use parsing in my project instead, because of the possibility of malformed input data and escaped quotes, see my update of the question.Incursion
@Dylan: My solution is resistant against malformed input data, and can be modified to work with escaped quote. But then again, it is not easily maintainable without deep regex knowledge, and cannot point out where exactly the syntax error is (it knows that error is somewhere ahead, but not exactly where). Manual parsing is better in such cases.Chausses
@BartKiers This answer looks great according to my usecase, but doesn't works, can you please help me out with this at #37184410Curtiscurtiss
Doesn't work on this input string as it's omitting the spaces: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36Halakah
@CardinalSystem in the original string, it also omits chars: the comma's and spaces are not included in that case.Honshu
C
3

Crazy solution

A spartan regex that tokenizes and also validates all the tokens that it extracts:

\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()',\s]++\s*+(?(?!\)),)|\s*+'[^'\r\n]*+'\s*+(?(?!\)),))++\))|[^()',\s]++|'[^'\r\n]*+')\s*+(?:,|$)

Regex101

Put it in string literal, with delimiter:

'/\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()\',\s]++\s*+(?(?!\)),)|\s*+\'[^\'\r\n]*+\'\s*+(?(?!\)),))++\))|[^()\',\s]++|\'[^\'\r\n]*+\')\s*+(?:,|$)/'

ideone

The result is in capturing group 1. In the example on ideone, I specify PREG_OFFSET_CAPTURE flag, so that you can check against the last match in group 0 (entire match) whether the entire source string has been consumed or not.

Assumptions

  • Non-quoted text may not contain any whitespace character, as defined by \s. Consequently, it may not span multiple lines.
  • Non-quoted text may not contain (, ), ' or ,.
  • Non-quoted text must contain at least 1 character.
  • Single quoted text may not span multiple lines.
  • Single quoted text may not contain quote. Consequently, there is no way to specify '.
  • Single quoted text may be empty.
  • Bracket token contains one or more of the following as sub-tokens: non-quoted text token, single quoted text token, or another bracket token.
  • In bracket token, 2 adjacent sub-tokens are separated by exactly one ,
  • Bracket token starts with ( and ends with ).
  • Consequently, a bracket token must have balanced brackets, and empty bracket () is not allowed.
  • Input will contain one or more of: non-quoted text, single quoted text or bracket token. The tokens in the input are separated with comma ,. Single trailing comma , is considered valid.
  • Whitespace character (as defined by \s, which includes new line character) are arbitrarily allowed between token(s), comma(s) , separating tokens, and the bracket(s) (, ) of the bracket tokens.

Breakdown

\G\s*+
(
  (
    \(
    (?:
        \s*+
        (?2)
        \s*+
        (?(?!\)),)
      |
        \s*+
        [^()',\s]++
        \s*+
        (?(?!\)),)
      |
        \s*+
        '[^'\r\n]*+'
        \s*+
        (?(?!\)),)
    )++
    \)
  )
  |
  [^()',\s]++
  |
  '[^'\r\n]*+'
)
\s*+(?:,|$)
Chausses answered 5/3, 2013 at 22:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.