How to get all captures of subgroup matches with preg_match_all()? [duplicate]
Asked Answered
L

8

20

Update/Note:

I think what I'm probably looking for is to get the captures of a group in PHP.

Referenced: PCRE regular expressions using named pattern subroutines.

(Read carefully:)


I have a string that contains a variable number of segments (simplified):

$subject = 'AA BB DD '; // could be 'AA BB DD CC EE ' as well

I would like now to match the segments and return them via the matches array:

$pattern = '/^(([a-z]+) )+$/i';
$result = preg_match_all($pattern, $subject, $matches);

This will only return the last match for the capture group 2: DD.

Is there a way that I can retrieve all subpattern captures (AA, BB, DD) with one regex execution? Isn't preg_match_all suitable for this?

This question is a generalization.

Both the $subject and $pattern are simplified. Naturally with such the general list of AA, BB, .. is much more easy to extract with other functions (e.g. explode) or with a variation of the $pattern.

But I'm specifically asking how to return all of the subgroup matches with the preg_...-family of functions.

For a real life case imagine you have multiple (nested) level of a variant amount of subpattern matches.

Example

This is an example in pseudo code to describe a bit of the background. Imagine the following:

Regular definitions of tokens:

   CHARS := [a-z]+
   PUNCT := [.,!?]
   WS := [ ]

$subject get's tokenized based on these. The tokenization is stored inside an array of tokens (type, offset, ...).

That array is then transformed into a string, containing one character per token:

   CHARS -> "c"
   PUNCT -> "p"
   WS -> "s"

So that it's now possible to run regular expressions based on tokens (and not character classes etc.) on the token stream string index. E.g.

   regex: (cs)?cp

to express one or more group of chars followed by a punctuation.

As I now can express self-defined tokens as regex, the next step was to build the grammar. This is only an example, this is sort of ABNF style:

   words = word | (word space)+ word
   word = CHARS+
   space = WS
   punctuation = PUNCT

If I now compile the grammar for words into a (token) regex I would like to have naturally all subgroup matches of each word.

  words = (CHARS+) | ( (CHARS+) WS )+ (CHARS+)    # words resolved to tokens
  words = (c+)|((c+)s)+c+                         # words resolved to regex

I could code until this point. Then I ran into the problem that the sub-group matches did only contain their last match.

So I have the option to either create an automata for the grammar on my own (which I would like to prevent to keep the grammar expressions generic) or to somewhat make preg_match working for me somehow so I can spare that.

That's basically all. Probably now it's understandable why I simplified the question.


Related:

Lacerate answered 16/6, 2011 at 11:41 Comment(8)
If you're generalising your question so much that alternative though correct answers can be given, your question isn't that valuable. Don't simplify if you don't want the simplified answers. -1.Teeming
I'm looking for an answer on a specific topic. I don't see why simplification should be bad to make this visible, albeit I see that a certain level of abstractness can be a burden.Lacerate
Well, obviously, because you want an answer on a subgroup, while your example doesn't include the need for a subgroup. The example is flawed.Teeming
@Berry Langerak: There is always some loss in simplification. You find a more detailed example added now.Lacerate
Just stumbled over: J (PCRE_INFO_JCHANGED) - The (?J) internal option setting changes the local PCRE_DUPNAMES option. Allow duplicate names for subpatterns which might not solve this here but is generally interesting: php.net/manual/en/reference.pcre.pattern.modifiers.phpLacerate
Could preg_split be extrapolated? Split string by delimiter, but not if it is escaped.Lacerate
a https://mcmap.net/q/664551/-how-to-find-leaf-arrays-in-nested-arrays of q https://mcmap.net/q/664551/-how-to-find-leaf-arrays-in-nested-arrays/367456Lacerate
Another related question is: Collapse and Capture a Repeating Pattern in a Single Regex Expression - It got some attention lately.Lacerate
D
4

Similar thread: Get repeated matches with preg_match_all()

Check the chosen answer plus mine might be useful I will duplicate there:

From http://www.php.net/manual/en/regexp.reference.repetition.php :

When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration.

I personally give up and going to do this in 2 steps.

EDIT:

I see in that other thread someone claimed that lookbehind method is able doing it.

Dollhouse answered 17/6, 2014 at 17:23 Comment(0)
D
3

Try this:

preg_match_all("'[^ ]+'i",$text,$n);

$n[0] will contain an array of all non-space character groups in the text.

Edit: with subgroups:

preg_match_all("'([^ ]+)'i",$text,$n);

Now $n[1] will contain the subgroup matches, that are exactly the same as $n[0]. This is pointless actually.

Edit2: nested subgroups example:

$test = "Hello I'm Joe! Hi I'm Jane!";
preg_match_all("/(H(ello|i)) I'm (.*?)!/i",$test,$n);

And the result:

Array
(
    [0] => Array
        (
            [0] => Hello I'm Joe!
            [1] => Hi I'm Jane!
        )

    [1] => Array
        (
            [0] => Hello
            [1] => Hi
        )

    [2] => Array
        (
            [0] => ello
            [1] => i
        )

    [3] => Array
        (
            [0] => Joe
            [1] => Jane
        )

)
Dior answered 16/6, 2011 at 11:51 Comment(4)
I'm interested in the matches of a variant number of subgroup matches. Your regex does not have any subgroups.Lacerate
Well then I don't understand your question. There is non need for subgroups for the matching you asked for.Dior
it's not only you that don't understand the question. Is the question that is completely wrong because Hakre can't explain himself. -1 for the questionXylem
I've added a little more info to make visible that it has a certain level of abstraction / generalization.Lacerate
H
2

Is there a way that I can retrieve all matches (AA, BB, DD) with one regex execution? Isn't preg_match_all not suitable for this?

Your current regex seems to be for a preg_match() call. Try this instead:

$pattern = '/[a-z]+/i';
$result = preg_match_all($pattern, $subject, $matches);

Per comments, the ruby regex I mentioned:

sentence = %r{
(?<subject>   cat   | dog        ){0}
(?<verb>      eats  | drinks     ){0}
(?<object>    water | bones      ){0}
(?<adjective> big   | smelly     ){0}
(?<obj_adj>   (\g<adjective>\s)? ){0}
The\s\g<obj_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x

md = sentence.match("The cat drinks water");
md = sentence.match("The big dog eats smelly bones");

But I think you'll need a lexer/parser/tokenizer to do the same kind of thing in PHP. :-|

Honeysuckle answered 16/6, 2011 at 18:25 Comment(16)
Please read the longer example at the end. I'm really looking into subgroup pattern matching over a full match that spares me to write a parser for groups and repetition of the BNF grammar. Therefore I need all of the (sub) matches while consuming the whole subject. preg_match_all will from it's subpatterns always return the last match when those can have a repetition.Lacerate
I think what you're trying to do is achievable with named groups and a recursive regex, but I'm not sure that PHP supports the latter. You might be able to manage it in ruby, though.Honeysuckle
I'll chew on it a bit this evening.Honeysuckle
Btw, what's wrong with the idea of doing: $pattern = '/regex1|regex2/' in my above suggestion? You'd arguably need to test each one for punctuation, but at least they'll be split properly and the individual word/punct groups will be extracted, no?Honeysuckle
No because it's grammar: There is at least one group per word and there is the semantics of the words together to form the next word of the grammar. So it's stacked. And it's with optional repetition inside these stacks. So if I only could grab the data of the matches, this would be perfect. However it's returning only the last backreference. would be cool to have a stack of backreference even after regex execution.Lacerate
Last question... Have you looked into PHP-based lexers and tokenizers? I ask, because it may be that what you're trying to parse won't necessarily be achievable using regular expressions.Honeysuckle
Yes I did but I'm always open to suggestions. I experimented with the pear, the lemon and the java one. As for chomsky: I have the code to validate already the whole value and it works great. My problem is the slicing, so that actually I come one step ahead from tokens into the elements of the grammar.Lacerate
Yeah, the thing is, I'm suspicious that you'll be able to manage this using regular expressions. I could arguably post the regex from p.135 of "Programming Ruby 1.9", but I'm a) suspicious they work in PHP (in fact, nearly certain they don't, due to the recursive regex flavor) and b) still suffer from not matching all of the individual tokens. (The syntax is, basically /?<subject>cat|dog)<?object>meows|barks)The\s\g<subject>\s\g<verb>/ with a recursive twist to it.)Honeysuckle
(I've added the above-mentioned regex, for information.)Honeysuckle
The issue is, the catching problem is still around I think. I'm pretty sure that replacing the (\g<adjective>\s)? with (\g<adjective>\s)+ would yield an issue similar to that which you're getting with preg_match_all().Honeysuckle
That said, my previous comment prompted a thought. Why not match and capture ([a-z]+ )+ and explode() the result?Honeysuckle
I do not understand what you mean. What should explode do? if this gets somewhere deeper, explode is linear.Lacerate
It was a silly comment. I considered it further and it would be equivalent to calling explode(' ', $str) directly. :-(Honeysuckle
php also allows definitions:https://mcmap.net/q/21104/-regex-to-validate-jsonBerna
@useless: since a very recent version, yes.Honeysuckle
I would not know what you consider as recent, but php 5.2 has been around for around 8 years (since 2006), I am sure 5.2 supports it and am almost sure that any php 5.0 also does.Berna
M
1

You can't extract the subpatterns because the way you wrote your regex returns only one match (using ^ and $ at the same time, and + on the main pattern).

If you write it this way, you'll see that your subgroups are correctly there:

$pattern = '/(([a-z]+) )/i';

(this still has an unnecessary set of parentheses, I just left it there for illustration)

Moffitt answered 16/6, 2011 at 12:1 Comment(9)
Is it possible to make the expression always consume the whole subject?Lacerate
@Lacerate My regex? Yes, it will. It will return all the patterns that match the rule. Actually '/([a-z]+) /i' should be enough.Moffitt
When I add a # to the end of subject, it does return matches albeit it does not consumes the whole $subject. I had added start and end marker to my pattern because I wanted to stretch it over the full contents of $subject.Lacerate
@Lacerate What do you want to happen when a # is added at the end of string exactly? Your pattern consumes the whole string, the # will just not be matched. If you need it to be matched, you need a different regex. Please explain what do you exactly want.Moffitt
Hmm, so you do not see a way to use ^ and $ within the pattern? I was building a parser that transforms a ABNF into regex and I want to preserve the matching of subgroups but the grammer needs to always match all words in sentences and groups - as a whole.Lacerate
@Lacerate Nope. Then you will match the whole string (which is not your goal). I could help if you clarified what you exactly want to happen.Moffitt
I want to match the whole string, but I want to get all subpattern matches as well - perhaps it's not possible with preg_match_all. That's just what I would like to know.Lacerate
@Lacerate Possibly you could match the whole string with preg_match(), and if it is fine, run the preg_match_all() to extract the values.Moffitt
@bazmegakapa: I added an example for some background info.Lacerate
I
0

Edit

I didn't realize what you had originally asked for. Here is the new solution:

$result = preg_match_all('/[a-z]+/i', $subject, $matches);
$resultArr = ($result) ? $matches[0] : array();
Iliac answered 16/6, 2011 at 11:47 Comment(1)
That regex does not have any subgroups. I was looking for matches of subgroups specifically.Lacerate
G
0

How about:

$str = 'AA BB CC';
$arr = preg_split('/\s+/', $str);
print_r($arr);

output:

(
    [0] => AA
    [1] => BB
    [2] => CC
)
Gnomon answered 16/6, 2011 at 12:25 Comment(0)
N
0

I may have misunderstood what you're describing. Are you just looking for a pattern for groups of letters with whitespace between?

// any subject containing words:
$subject = 'AfdfdfdA BdfdfdB DdD'; 
$subject = 'AA BB CC';
$subject = 'Af df dfdA Bdf dfdB DdD';

$pattern = '/(([a-z]+)\s)+[a-z]+/i';

$result = preg_match_all($pattern, $subject, $matches);
print_r($matches);
echo "<br/>";
print_r($matches[0]);  // this matches $subject
echo "<br/>".$result;
Nippon answered 16/6, 2011 at 13:3 Comment(0)
R
0

Yes your right your solution is by using preg_match_all preg_match_all is recursive, so dont use start-with^ and end-with$, so that preg_match_all put all found patterns in an array.

Each new pair of parenthesis will add a New arrays indicating the different matches

use ? for optional matches

You can Separate different groups of patterns reported with the parenthesis () to ask for a group to be found and added in a new array (can allow you to count matches, or to categorize each matches from the returned array )

Clarification required

Let me try to understand you question, so that my answer match what you ask.

  1. Your $subject is not a good exemple of what your looking for?

  2. You would like the pregmatch search, to split what you provided in $subject in to 4 categories , Words, Characters, Punctuation and white spaces ? and what about numbers?

  3. As well you would like The returned matches, to have the offsets of the matches specified ?

Does $subject = 'aa.bb cc.dd EE FFF,GG'; better fit a real life exemple?

I will take your basic exemple in $subject and make it work to give your exactly what your asked.

So can you edit your $subject so that i better fit all the cases that you want to match

Original '/^(([a-z]+) )+$/i';

Keep me posted, you can test your regexes here http://www.spaweditor.com/scripts/regex/index.php

Partial answer

/([a-z])([a-z]+)/i

AA BB DD CD

Array
(
    [0] => Array
        (
            [0] => AA
            [1] => BB
            [2] => DD
            [3] => CD
        )

    [1] => Array
        (
            [0] => A
            [1] => B
            [2] => D
            [3] => C
        )

    [2] => Array
        (
            [0] => A
            [1] => B
            [2] => D
            [3] => D
        )

)
Ricci answered 7/10, 2012 at 3:34 Comment(6)
No that is not the solution. Your example can not even validate that the whole string matches the regex, you've just shifted the problem onto a subset of the string instead of the whole string. Also where are the subgroups and all their matches/captures?Lacerate
I want to run preg_match_all and want to get all subgroup captures, not only the last ones.Lacerate
@Lacerate there is 2 1/2 types of subgroups, Cause your regex is flawed. all proper answers will be wrong, we dont know what kind of results you want, give us an exemple of the result array you want.Ricci
((a)(b)){2}) => return the two outer group matches, return the two inner group matches which then exist two times for example. This example could be a subgroup as well, not only the whole pattern. AFAIK this is not possible with PHP's regex engine in one go.Lacerate
I should put the example I give in the question into code so that it's abstract character get's some more "hands-on-like" representation. That should help maybe.Lacerate
Preg_match_all is recursive, so dont use start-with ^ and end-with $ Cauze has your regex, it will only give you a submatche on something that matches everything , wich is the last DD_Ricci

© 2022 - 2024 — McMap. All rights reserved.