Parse kebab-case sentence with predictable components
Asked Answered
O

3

7

Example strings:

accuracy-is-5

accuracy-is-5-or-15

accuracy-is-5-or-15-or-20

package-is-dip-8-or-dip-4-or-dip-16

My current regexp:

/^([a-z0-9\-]+)\-is\-([a-z0-9\.\-]*[a-z0-9])(?:\-or\-([a-z0-9\.\-]*[a-z0-9]))*$/U

No fixed length, part:

\-or\-[a-z0-9\.\-]

can be repeated.

But now from string accuracy-is-5-or-15-or-20, I get:

Array (
    [0] => accuracy-is-5-or-15-or-20
    [1] => accuracy
    [2] => 5
    [3] => 20
)

Where is 15?

Orchid answered 27/5, 2015 at 12:38 Comment(2)
When a capture group is repeated the last value overwrite the previous.Karly
Sorry. i don't understand the answer below from vks. How i can capture all value in my example?Orchid
K
3

When a capture group is repeated in a pattern the previous values are overwritten with the last. So it is not possible to design your pattern like this with preg_match.

A possible workaround consists to use preg_match_all that searches all occurrences of a pattern and the \G anchor that is the position after the previous match. The pattern must be written to find one value at a time.

The \G ensures that all matches are contiguous. To be sure that the end of the string has been reached (in other word that the string is correctly formatted from start to the end), a convenient way is to create an empty capture group at the end. So if this capture group appears in the last match, that means that the format is correct.

define('PARSE_SENTENCE_PATTERN', '~
(?:                                       # two possible beginings:
    \G(?!\A)                              # - immediatly after a previous match 
  |                                       # OR
    \A                                    # - at the start of the string
    (?<subject> \w+ (?>[-.]\w+)*? ) -is-  #  (in this case the subject is captured)
)
(?<value> \w+ (?>[-.]\w+)*? )  # capture the value
(?: -or- | \z (?<check>) )     # must be followed by "-or-" OR the end of the string \z
                               # (then the empty capture group "check" is created)
~x');

function parseSentence ($sentence) {

    if (preg_match_all(PARSE_SENTENCE_PATTERN, $sentence, $matches, PREG_SET_ORDER) &&
        isset(end($matches)['check']) ) 
        return [ 'subject' => $matches[0]['subject'],
                 'values'  => array_reduce ($matches, function($c, $v) {
                                  $c[] = $v['value']; return $c; }, $c = []) ];

    return false; // wrong format

}

// tests
$test_strings = ['accuracy-is-5', 'accuracy-is-5-or-15', 'accuracy-is-5-or-15-or-20',
                 'package-is-dip-8-or-dip-4-or-dip-16',
                 'bad-format', 'bad-format-is-', 'bad-format-is-5-or-'];

foreach ($test_strings as $test_string) {
    var_dump(parseSentence($test_string));
}
Karly answered 27/5, 2015 at 14:46 Comment(0)
L
5
^\w+(?:-[a-zA-Z]+)+\K|\G(?!^)-(\d+)(?:(?:-[a-zA-Z]+)+|$)

You can use \G here to capture all groups.Whenever a capture group is repeated the last value overwrites the previous.See demo.

https://regex101.com/r/tS1hW2/3

\G assert position at the end of the previous match or the start of the string for the first match

EDIT:

^\w+-is(?:-dip)?\K|\G(?!^)-(\d+)(?:-or(?:-dip)?|$)

You can use this if you are sure of is,or and dip.See demo.

https://regex101.com/r/tS1hW2/4

$re = "/^\\w+-is(?:-dip)?\\K|\\G(?!^)-(\\d+)(?:-or(?:-dip)?|$)/m"; 
$str = "accuracy-is-5\naccuracy-is-5-or-15\naccuracy-is-5-or-15-or-20\npackage-is-dip-8-or-dip-4-or-dip-16"; 

preg_match_all($re, $str, $matches);
Laconism answered 27/5, 2015 at 12:43 Comment(7)
Too difficult example for understanding. Can u show using "\G" on my regexp?Orchid
Example: "package-is-dip-8-or-dip-4-or-dip-16" "package" - its an attribute name. variable (like size, length etc) "-is-" - always present in string (one time) "dip-8" - its option for attribute. variable (package may be dip-8, dip-4 etc. Or black, white ... window, door). "-or-" - present only if attribute options more than one.Orchid
preg_match('/^\w+-is(?:-dip)?\K|\G(?!^)-(\d+)(?:-or(?:-dip)?|$)/m', 'accuracy-is-5-or-15-or-20', $matches); print_r($matches); Result: Array ( [0] => )Orchid
In regexp should not be "dip". It's variable, option name. "accuracy-is-5-or-15", "package-is-dip-8" .... "window-color-is-white-or-black"Orchid
pastebin.com/cruWk8MT No :( From string "package-is-dip-8-or-dip-4-or-dip-16" i need to get an array where are substrings: "package" (it's attribute name), "dip-8", "dip-4", "dip-16" (its attribute options)Orchid
It's not correct. In array only numbers (8, 4, 16). I need in that format "dip-8", "dip-4" and "dip-16". And there are not an attribute name (package).Orchid
Now so close) pastebin.com/zA0zGeTu But where is substring "package" in array? And there are empty value for first element.Orchid
K
3

When a capture group is repeated in a pattern the previous values are overwritten with the last. So it is not possible to design your pattern like this with preg_match.

A possible workaround consists to use preg_match_all that searches all occurrences of a pattern and the \G anchor that is the position after the previous match. The pattern must be written to find one value at a time.

The \G ensures that all matches are contiguous. To be sure that the end of the string has been reached (in other word that the string is correctly formatted from start to the end), a convenient way is to create an empty capture group at the end. So if this capture group appears in the last match, that means that the format is correct.

define('PARSE_SENTENCE_PATTERN', '~
(?:                                       # two possible beginings:
    \G(?!\A)                              # - immediatly after a previous match 
  |                                       # OR
    \A                                    # - at the start of the string
    (?<subject> \w+ (?>[-.]\w+)*? ) -is-  #  (in this case the subject is captured)
)
(?<value> \w+ (?>[-.]\w+)*? )  # capture the value
(?: -or- | \z (?<check>) )     # must be followed by "-or-" OR the end of the string \z
                               # (then the empty capture group "check" is created)
~x');

function parseSentence ($sentence) {

    if (preg_match_all(PARSE_SENTENCE_PATTERN, $sentence, $matches, PREG_SET_ORDER) &&
        isset(end($matches)['check']) ) 
        return [ 'subject' => $matches[0]['subject'],
                 'values'  => array_reduce ($matches, function($c, $v) {
                                  $c[] = $v['value']; return $c; }, $c = []) ];

    return false; // wrong format

}

// tests
$test_strings = ['accuracy-is-5', 'accuracy-is-5-or-15', 'accuracy-is-5-or-15-or-20',
                 'package-is-dip-8-or-dip-4-or-dip-16',
                 'bad-format', 'bad-format-is-', 'bad-format-is-5-or-'];

foreach ($test_strings as $test_string) {
    var_dump(parseSentence($test_string));
}
Karly answered 27/5, 2015 at 14:46 Comment(0)
V
1

Perhaps it will be clearer to understand if subpatterns are declared as individual self-describing variables and the pattern is build via interpolation. Also, to eliminate any messy $matches array clean up, my pattern will only populate fullstring matches (no capture groups) -- this means you only need to access the first element of the matches array.

\K means "forget the previously matched characters" in other words "restart the fullstring match from here".

\G means "match from the start of the input string or from the point where the previous match left off".

The lookahead that follows the match of the "subject" of the sentence ensures that only a fully valid "sentence" will qualify.

Code: (Demo)

$tests = [
    'package',
    'accuracy-is-5',
    'accuracy-is-5-or-15',
    'accuracy-is-5-or-15-or-20',
    'package-is-dip-8-or-dip-4-or-dip-16',
    'bad-format',
    'bad-format-is-',
    'bad-format-is-5-or-',
];

$noun = '(?:dip-)?\d+'; // valid value subpattern
$verb = '-is-';  // literal -is- subpattern
$conjunction = '-or-';  // literal -or- subpattern
$subject = "^[a-z\d-]+"; // match leading word(s)
$predicate = "$verb$noun(?:$conjunction$noun)*$"; // lookahead for the valid remainder of string 
$continue = '\G(?!^)';  // continue from point of last match, but not the start of the string

foreach ($tests as $test) {
    if (preg_match_all("/(?:$subject(?=$predicate)|$continue(?:$verb|$conjunction)\K$noun)/", $test, $m)) {
        echo json_encode($m[0]) . "\n";
    }
}

Output:

["accuracy","5"]
["accuracy","5","15"]
["accuracy","5","15","20"]
["package","dip-8","dip-4","dip-16"]
Vincenzovincible answered 10/6 at 7:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.