I've read through this question and all its answers, and I feel they leave a great deal of ambiguity. So, in the interest of clearing things up:
1. String.prototype.match
I would expect this line of JavaScript:
"foo bar baz".match(/^(\s*\w+)+$/)
to return something like:
["foo bar baz", "foo", " bar", " baz"]
In order to get the desired output, you need to explicitly capture all three groups (particularly because you're bracketing the pattern with ^...$
, indicating you want the whole string to pass the test in ONE match). String.prototype.match
will return that format ([FULL_MATCH, CAP1, CAP2, CAP3, ... CAPn]
) - it's "exec" response (so called because the String.prototype.match is an implementation of RegExp.prototype.exec) - when an expression matches the target string without a /g
lobal flag set. Consider:
// EX.1
"foo bar baz".match(/^(\w+) (\w+) (\w+)$/)
...which yields...
// RES.1
['foo bar baz', 'foo', 'bar', 'baz']
Why? Because you're declaratively capturing 3 distinct groups.
Compare this to:
// EX.2 (Note the /g lobal flag and absence of the `^` and `$`)
"foo bar baz".match(/(\w+)/g)
...which yields...
// RES.2
['foo', ' bar', ' baz']
This is because the .match
method serves double-duty. In a non-/g
lobal match, it will return the match as the first element in the returned array, and each capture group as an additional node in that same array. This is because match()
is simply a syntactic sugar implementation of RegExp.prototype.exec
.
The format of the results, therefore, in Res.2 are the consequence of the /g
lobal flag, indicating to the compiler, "each time you find a match of this pattern, return it, and resume from the end of that match to see if there are more". Since RegExp.prototype.exec returns the FIRST occurrence, but you've provided a "/g
o ahead and continue until we run out" flag, you're seeing a collection containing the first occurrence, multiple times.
2. String.prototype.matchAll
If you DID need the full, "exec match" syntax and wanted the captures for ALL n matches, and you're willing to use the /g
lobal flag (and you'll HAVE to for this to work), you need matchAll
. It returns a RegExpStringIterator
, a specialized sort of collection that REQUIRES the use of the /g
lobal. Let's re-run the same query in EX.2 with a matchAll
:
// EX.3a
"foo bar baz".matchAll(/(\w+)/g)
// RES.3a
RegExpStringIterator
Because it hands back an Iterator, to actually get our grubby mitts on the data, we'll use a spread operator (...<ITERATOR>
). Since we then need something to spread it INTO, we'll wrap the whole lot in the Array constructor shorthand ([...<ITERATOR>]
) We get:
// EX.3b
[..."foo bar baz".matchAll(/(\w+)/g)]
// RES.3b
[
['foo', 'foo'], ['bar', 'bar'], ['baz', 'baz']
]
As you can see, 3 matches, each an array of [<MATCH>, <CAPGROUP>]
.
Really, this all boils down to the fact that match
is returning, well, matches, NOT captures. If a match happens to BE a capture (or contain multiples) it's helpful enough to break those out for you when you wrap them in parens. Indeed, "foo bar baz".match(/\w+/g)
(note presence of /g
lobal and absence of capture parens) will still yield ['foo', 'bar', 'baz']
. It found 3 matches, you didn't specify you wanted groups, so it exec'd its way into finding them.
ALL of which, I believe, is in large part due to a huge misconception about how RegExp returns results. Namely,
3. A GROUP is not a MATCH
Part of the ambiguity here is the terminology. One can have MULTIPLE capture groups contained within the same match. One cannot have multiple matches in one group. Think a venn diagram:
This may be easier to visual using the syntax. Say we used the regex:
EX.4
"foo bar baz".match(/(?<GROUP1>\w+) (?<GROUP2>\w+) (?<GROUP3>\w+)/).groups
The only thing I've changed from Ex.1 is I've assigned a name (the ?<NAME>
syntax) to each capture group (the portion of the match defined by the pattern contained within the parentheticals - (...)
). Because they're named, out response array has an additional attribute: .groups:
// RES.4
[
"foo bar baz",
"foo",
"bar",
"baz"
],
groups: {
"GROUP1": "foo",
"GROUP2": "bar",
"GROUP3": "baz"
}
Because we explicitly named all three capture GROUPS, we can see we have a single MATCH (the array containing the full match and all 3 capture groups' contents), along with our object of named captures.
So, finally, let's try your initial attempt with the extra info:
// EX.5
"foo bar baz".match(/^(?<GROUPn>\s*\w+)+$/)
// RES.5
[
"foo bar baz",
" baz"
],
groups: {
"GROUPn": "baz"
}
Wait, what gives?
Because you've only declaratively specified a single capture group (which I've helpfully labelled "GROUPn"), you've provided only one "slot" for the capture to land in.
In short: it's not that your expression isn't capturing all three elements... it's that the "slot" - the variable being used to store that return value as it makes its way to you in the response is being overwritten twice. All of which is to say: One cannot store multiple captures in one capture group (at least in ECMA's RegExp engine).
You can certainly store multiple matches (and if that's all you need, hey, great) but there are times when one cannot iterate the result set before applying it.
Take, as a final example:
// EX.6a
console.log("foo bar baz".replace(/(...) (...) (...)/, "The first word is '$1'.\nThe second word is '$2'.\nThe third word is '$3'."))
// RES.6a
"The first word is 'foo'.
The second word is 'bar'.
The third word is 'baz'."
In this instance, we NEED ALL THREE captures in the SAME match, so we can directly reference each in the same replace operation. If we tried this with your expression:
// EX.6b
console.log("foo bar baz".replace(/^(\s*\w+)+$/, "The first word is '$1'.\nThe second word is '$2'.\nThe third word is '$3'."))
...we end up with the confusingly-inaccurate...
// RES.6b
The first word is ' baz'.
The second word is '$2'.
The third word is '$3'.
Hope this helps someone in the future.