split string by spaces and colon but not if inside quotes

Asked 9/10, 2015 at 14:52 Answered 9/10, 2015 at 16:53

Solved php regex preg-match preg-match-all preg-split

having a string like this:

$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf"

the desired result is:

[0] => Array (
    [0] => dateto:'2015-10-07 15:05'
    [1] => xxxx
    [2] => datefrom:'2015-10-09 15:05'
    [3] => yyyy
    [4] => asdf
)

what I get with:

preg_match_all("/\'(?:[^()]|(?R))+\'|'[^']*'|[^(),\s]+/", $str, $m);

is:

[0] => Array (
    [0] => dateto:'2015-10-07
    [1] => 15:05'
    [2] => xxxx
    [3] => datefrom:'2015-10-09
    [4] => 15:05'
    [5] => yyyy
    [6] => asdf
)

Also tried with preg_split("/[\s]+/", $str) but no clue how to escape if value is between quotes. Can anyone show me how and also please explain the regex. Thank you!

Varistor answered 9/10, 2015 at 14:52 Comment(0)

I would use PCRE verb (*SKIP)(*F),

preg_split("~'[^']*'(*SKIP)(*F)|\s+~", $str);

DEMO

Rigney answered 9/10, 2015 at 14:55 Comment(2)

Thank you! Would you mind explaining "~'[^']*'(*SKIP)(*F)|\s+~" I just understand parts of it and I would like to get it all – Varistor 9/10, 2015 at 15:4

'[^']*' matches all the single quoted block and the following (*SKIP)(*F) makes the match to fail. and the following |\s+ matches all the remaining spaces. – Rigney 9/10, 2015 at 15:8

Often, when you are looking to split a string, using preg_split isn't the best approach (that seems a little counter intuitive, but that's true most of the time). A more efficient way consists to find all items (with preg_match_all) using a pattern that describes all that is not the delimiter (white-spaces here):

$pattern = <<<'EOD'
~(?=\S)[^'"\s]*(?:'[^']*'[^'"\s]*|"[^"]*"[^'"\s]*)*~
EOD;

if (preg_match_all($pattern, $str, $m))
    $result = $m[0];

pattern details:

~                    # pattern delimiter

(?=\S)               # the lookahead assertion only succeeds if there is a non-
                     # white-space character at the current position.
                     # (This lookahead is useful for two reasons:
                     #    - it allows the regex engine to quickly find the start of
                     #      the next item without to have to test each branch of the
                     #      following alternation at each position in the strings
                     #      until one succeeds.
                     #    - it ensures that there's at least one non-white-space.
                     #      Without it, the pattern may match an empty string.
                     # )

[^'"\s]*          #"'# all that is not a quote or a white-space

(?:                  # eventual quoted parts
    '[^']*' [^'"\s]*  #"# single quotes
  |
    "[^"]*" [^'"\s]*    # double quotes
)*
~

demo

Note that with this a little long pattern, the five items of your example string are found in only 60 steps. You can use this shorter/more simple pattern too:

~(?:[^'"\s]+|'[^']*'|"[^"]*")+~

but it's a little less efficient.

Woke answered 9/10, 2015 at 16:53 Comment(3)

Thank you for this detailed answer! A frew more things I would like to know: "but that's true most of the time" is there a rule of thumb or some link I can read about when/why to use which one? How did you write the REGEX? you got a tool to do that or you know regex rules and just write it down? if just writing it down: how did you learn regex rules? – Varistor 10/10, 2015 at 7:20

@caramba: It's more a rule of thumb, but ideas behind are relatively simple: 1) when the delimiter must take in account this environment, the pattern becomes quickly complicated and inefficient (in particular if you need to check what are the characters before, or if you need to check the string until the end with a lookahead). 2) Sometimes it is more easy to define something by negation. – Woke 10/10, 2015 at 10:58

@caramba: About how I write a pattern in general comes with the knowledge, the practice and testing. For example a pattern like (?:[^'\s]+|'[^']*')*+ is more efficient if you "unroll" it, like this: [^'\s]*(?:'[^']*'[^'\s]*)*+, you can find this information in the Friedl book, but you can also see it with regex101 or regexbuddy that display the number of steps needed. But even with knowledge and recipes, you always need to experiment, in particular you must well know your enemy: the string. – Woke 10/10, 2015 at 11:10

For your example, you can use preg_split with negative lookbehind (?<!\d), i.e.:

<?php
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf";
$matches = preg_split('/(?<!\d)(\s)/', $str);
print_r($matches);

Output:

    Array
    (
        [0] => dateto:'2015-10-07 15:05'
        [1] => xxxx
        [2] => datefrom:'2015-10-09 15:05'
        [3] => yyyy
        [4] => asdf
    )

Demo:

http://ideone.com/EP06Nt

Regex Explanation:

(?<!\d)(\s)

Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\d)»
   Match a single character that is a “digit” «\d»
Match the regex below and capture its match into backreference number 1 «(\s)»
   Match a single character that is a “whitespace character” «\s»

Darill answered 9/10, 2015 at 15:9 Comment(1)

Thank you! ok "negative lookback" but where the hack is the ` ' ` defined?? how could I change if dateto:"has-double-quotes"? – Varistor 9/10, 2015 at 15:17

Recommended topics

Hot tags