How to capture an arbitrary number of groups in JavaScript Regexp?
Asked Answered
R

6

98

I would expect this line of JavaScript:

"foo bar baz".match(/^(\s*\w+)+$/)

to return something like:

["foo bar baz", "foo", " bar", " baz"]

but instead it returns only the last captured match:

["foo bar baz", " baz"]

Is there a way to get all the captured matches?

Romeyn answered 21/8, 2010 at 14:8 Comment(1)
This question didn't come up in my searches. The phrase "arbitrary number of groups" is ambiguous. The phrase "repeating group" is clearer, IMO.Raisin
T
106

When you repeat a capturing group, in most flavors, only the last capture is kept; any previous capture is overwritten. In some flavor, e.g. .NET, you can get all intermediate captures, but this is not the case with Javascript.

That is, in Javascript, if you have a pattern with N capturing groups, you can only capture exactly N strings per match, even if some of those groups were repeated.

So generally speaking, depending on what you need to do:

  • If it's an option, split on delimiters instead
  • Instead of matching /(pattern)+/, maybe match /pattern/g, perhaps in an exec loop
    • Do note that these two aren't exactly equivalent, but it may be an option
  • Do multilevel matching:
    • Capture the repeated group in one match
    • Then run another regex to break that match apart

References


Example

Here's an example of matching <some;words;here> in a text, using an exec loop, and then splitting on ; to get individual words (see also on ideone.com):

var text = "a;b;<c;d;e;f>;g;h;i;<no no no>;j;k;<xx;yy;zz>";

var r = /<(\w+(;\w+)*)>/g;

var match;
while ((match = r.exec(text)) != null) {
  print(match[1].split(";"));
}
// c,d,e,f
// xx,yy,zz

The pattern used is:

      _2__
     /    \
<(\w+(;\w+)*)>
 \__________/
      1

This matches <word>, <word;another>, <word;another;please>, etc. Group 2 is repeated to capture any number of words, but it can only keep the last capture. The entire list of words is captured by group 1; this string is then split on the semicolon delimiter.

Related questions

Torquay answered 21/8, 2010 at 14:24 Comment(0)
I
7

How's about this? "foo bar baz".match(/(\w+)+/g)

Indoeuropean answered 21/8, 2010 at 14:10 Comment(3)
Your code works, but adding a global flag to my example won't solve the problem: "foo bar baz".match(/^(\s*\w+)+$/g) will return ["foo bar baz"]Romeyn
it will work if you change it to @Jet's regular expression below. "foo bar baz".match(/\w+/g) //=> ["foo", "bar", "baz"]. it ignores the matched string at the front but is still a reasonable alternative.Jamnis
"Getting" all the captured MATCHES isn't the same as all the matching GROUPS. You can have MANY groups INSIDE one match. Consider: "foo bar baz".match(/(\w+)/g) (note inclusion of global flag). Three MATCHES, one group. This is vs. "foo bar baz".match(/(...) (...) (...)/). One match, three GROUPS. The distinction is non-trivial. On can access all three capture groups in a single expression (meaning they can be referenced as single components of the match: $1, $2, $3). The other must be ITERATED to interface with the results: "foo bar baz".match(/(\w+)/g).forEach(m=>console.log(m))Philippopolis
Q
6

Unless you have a more complicated requirement for how you're splitting your strings, you can split them, and then return the initial string with them:

var data = "foo bar baz";
var pieces = data.split(' ');
pieces.unshift(data);
Quincey answered 21/8, 2010 at 14:22 Comment(1)
This ended up being just the piece of advice I needed to wake me up to the fact that, for my current application at least, I didn't need anything more sophisticated than split().Johnsten
O
4

try using 'g':

"foo bar baz".match(/\w+/g)
Outfoot answered 21/8, 2010 at 15:41 Comment(0)
E
0

You can use LAZY evaluation. So, instead of using * (GREEDY), try using ? (LAZY)

REGEX: (\s*\w+)?

RESULT:

Match 1: foo

Match 2: bar

Match 3: baz

Exsert answered 28/9, 2021 at 10:53 Comment(0)
P
0

I've read through this question and all its answers, and I feel they leave a great deal of ambiguity. So, in the interest of clearing things up:

1. String.prototype.match

I would expect this line of JavaScript:

"foo bar baz".match(/^(\s*\w+)+$/) to return something like: ["foo bar baz", "foo", " bar", " baz"]

In order to get the desired output, you need to explicitly capture all three groups (particularly because you're bracketing the pattern with ^...$, indicating you want the whole string to pass the test in ONE match). String.prototype.match will return that format ([FULL_MATCH, CAP1, CAP2, CAP3, ... CAPn]) - it's "exec" response (so called because the String.prototype.match is an implementation of RegExp.prototype.exec) - when an expression matches the target string without a /global flag set. Consider:

// EX.1
"foo bar baz".match(/^(\w+) (\w+) (\w+)$/)

...which yields...

// RES.1
['foo bar baz', 'foo', 'bar', 'baz']

Why? Because you're declaratively capturing 3 distinct groups. Compare this to:

// EX.2 (Note the /g lobal flag and absence of the `^` and `$`)
"foo bar baz".match(/(\w+)/g)

...which yields...

// RES.2
['foo', ' bar', ' baz']

This is because the .match method serves double-duty. In a non-/global match, it will return the match as the first element in the returned array, and each capture group as an additional node in that same array. This is because match() is simply a syntactic sugar implementation of RegExp.prototype.exec.

The format of the results, therefore, in Res.2 are the consequence of the /global flag, indicating to the compiler, "each time you find a match of this pattern, return it, and resume from the end of that match to see if there are more". Since RegExp.prototype.exec returns the FIRST occurrence, but you've provided a "/go ahead and continue until we run out" flag, you're seeing a collection containing the first occurrence, multiple times.

2. String.prototype.matchAll

If you DID need the full, "exec match" syntax and wanted the captures for ALL n matches, and you're willing to use the /global flag (and you'll HAVE to for this to work), you need matchAll. It returns a RegExpStringIterator, a specialized sort of collection that REQUIRES the use of the /global. Let's re-run the same query in EX.2 with a matchAll:

// EX.3a
"foo bar baz".matchAll(/(\w+)/g)
// RES.3a
RegExpStringIterator

Because it hands back an Iterator, to actually get our grubby mitts on the data, we'll use a spread operator (...<ITERATOR>). Since we then need something to spread it INTO, we'll wrap the whole lot in the Array constructor shorthand ([...<ITERATOR>]) We get:

// EX.3b
[..."foo bar baz".matchAll(/(\w+)/g)]
// RES.3b
[
   ['foo', 'foo'], ['bar', 'bar'], ['baz', 'baz']
]

As you can see, 3 matches, each an array of [<MATCH>, <CAPGROUP>].

Really, this all boils down to the fact that match is returning, well, matches, NOT captures. If a match happens to BE a capture (or contain multiples) it's helpful enough to break those out for you when you wrap them in parens. Indeed, "foo bar baz".match(/\w+/g) (note presence of /global and absence of capture parens) will still yield ['foo', 'bar', 'baz']. It found 3 matches, you didn't specify you wanted groups, so it exec'd its way into finding them.

ALL of which, I believe, is in large part due to a huge misconception about how RegExp returns results. Namely,

3. A GROUP is not a MATCH

Part of the ambiguity here is the terminology. One can have MULTIPLE capture groups contained within the same match. One cannot have multiple matches in one group. Think a venn diagram: Venn Diagram showing 3 groups contained within a single match

This may be easier to visual using the syntax. Say we used the regex:

EX.4
"foo bar baz".match(/(?<GROUP1>\w+) (?<GROUP2>\w+) (?<GROUP3>\w+)/).groups

The only thing I've changed from Ex.1 is I've assigned a name (the ?<NAME> syntax) to each capture group (the portion of the match defined by the pattern contained within the parentheticals - (...)). Because they're named, out response array has an additional attribute: .groups:

// RES.4
[
    "foo bar baz",
    "foo",
    "bar",
    "baz"
],
    groups: {
        "GROUP1": "foo",
        "GROUP2": "bar",
        "GROUP3": "baz"
    }

Because we explicitly named all three capture GROUPS, we can see we have a single MATCH (the array containing the full match and all 3 capture groups' contents), along with our object of named captures.

So, finally, let's try your initial attempt with the extra info:

// EX.5
"foo bar baz".match(/^(?<GROUPn>\s*\w+)+$/)
// RES.5
[
    "foo bar baz",
    " baz"
],
    groups: {
        "GROUPn": "baz"
    }

Wait, what gives?

Because you've only declaratively specified a single capture group (which I've helpfully labelled "GROUPn"), you've provided only one "slot" for the capture to land in.

In short: it's not that your expression isn't capturing all three elements... it's that the "slot" - the variable being used to store that return value as it makes its way to you in the response is being overwritten twice. All of which is to say: One cannot store multiple captures in one capture group (at least in ECMA's RegExp engine).

You can certainly store multiple matches (and if that's all you need, hey, great) but there are times when one cannot iterate the result set before applying it.

Take, as a final example:

// EX.6a
console.log("foo bar baz".replace(/(...) (...) (...)/, "The first word is '$1'.\nThe second word is '$2'.\nThe third word is '$3'."))
// RES.6a
"The first word is 'foo'.
The second word is 'bar'.
The third word is 'baz'."

In this instance, we NEED ALL THREE captures in the SAME match, so we can directly reference each in the same replace operation. If we tried this with your expression:

// EX.6b
console.log("foo bar baz".replace(/^(\s*\w+)+$/, "The first word is '$1'.\nThe second word is '$2'.\nThe third word is '$3'."))

...we end up with the confusingly-inaccurate...

// RES.6b
The first word is ' baz'.
The second word is '$2'.
The third word is '$3'.

Hope this helps someone in the future.

Philippopolis answered 24/9, 2023 at 16:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.