Get index of each capture in a JavaScript regex

Asked 10/4, 2013 at 19:8 Answered 23/3, 2023 at 2:45

I want to match a regex like /(a).(b)(c.)d/ with "aabccde", and get the following information back:

"a" at index = 0
"b" at index = 2
"cc" at index = 3

How can I do this? String.match returns list of matches and index of the start of the complete match, not index of every capture.

Edit: A test case which wouldn't work with plain indexOf

regex: /(a).(.)/
string: "aaa"
expected result: "a" at 0, "a" at 2

Note: The question is similar to Javascript Regex: How to find index of each subexpression?, but I cannot modify the regex to make every subexpression a capturing group.

Lucrative answered 10/4, 2013 at 19:8 Comment(9)

All of your subexpressions are already within capturing groups. – Smiley 10/4, 2013 at 19:11

@Asad, where? 2 letters are not within capturing groups. – Lucrative 10/4, 2013 at 19:12

If you use global matching, you can get repetitive cases of the captured groups. In that case you need to use the callback function, like shown in the link your have in your question. – Hemelytron 10/4, 2013 at 19:13

@canon please check my edit for a simple test case which won't work with that. – Lucrative 10/4, 2013 at 19:16

There doesn't seem to be any function that returns this information. However, I rarely see any usage for getting the index of the match, maybe except for the case where you want to write a regex tester. – Powers 10/4, 2013 at 20:3

@nhahtdh, for now, I want to wrap captures within html tags (specific to the matched string), like with string "1 + 2" and regex "(\d+)\s*(\+)\s*(\d+)", wrap numbers in "<number></number>", and plus with "<plus></plus>". Is there a better way, without modifying the regex? – Lucrative 15/4, 2013 at 14:7

@user1527166: If you are doing replacement, I think it is possible. You probably should ask a new question for that, though. – Powers 15/4, 2013 at 22:6

@Artur The old answer here showing how to use MultiRegExp2 looks to do what you want (including nested capture groups), did you try it / are you having problems with it? I'm not entirely understanding the purpose of the bounty – Zamir 24/8, 2019 at 22:11

Well, if you looked in the source you could notice that MultiRegExp2 is parsing the regex string. I have really big doubts about reliability of such approach. – Geyserite 26/8, 2019 at 8:18

There is currently a proposal (stage 4) to implement this in native Javascript:

RegExp Match Indices for ECMAScript

ECMAScript RegExp Match Indices provide additional information about the start and end indices of captured substrings relative to the start of the input string.

...We propose the adoption of an additional indices property on the array result (the substrings array) of RegExp.prototype.exec(). This property would itself be an indices array containing a pair of start and end indices for each captured substring. Any unmatched capture groups would be undefined, similar to their corresponding element in the substrings array. In addition, the indices array would itself have a groups property containing the start and end indices for each named capture group.

Here's an example of how things would work. The following snippets run without errors in, at least, Chrome:

const re1 = /a+(?<Z>z)?/d;

// indices are relative to start of the input string:
const s1 = "xaaaz";
const m1 = re1.exec(s1);
console.log(m1.indices[0][0]); // 1
console.log(m1.indices[0][1]); // 5
console.log(s1.slice(...m1.indices[0])); // "aaaz"

console.log(m1.indices[1][0]); // 4
console.log(m1.indices[1][1]); // 5
console.log(s1.slice(...m1.indices[1])); // "z"

console.log(m1.indices.groups["Z"][0]); // 4
console.log(m1.indices.groups["Z"][1]); // 5
console.log(s1.slice(...m1.indices.groups["Z"])); // "z"

// capture groups that are not matched return `undefined`:
const m2 = re1.exec("xaaay");
console.log(m2.indices[1]); // undefined
console.log(m2.indices.groups.Z); // undefined

So, for the code in the question, we could do:

const re = /(a).(b)(c.)d/d;
const str = 'aabccde';
const result = re.exec(str);
// indices[0], like result[0], describes the indices of the full match
const matchStart = result.indices[0][0];
result.forEach((matchedStr, i) => {
  const [startIndex, endIndex] = result.indices[i];
  console.log(`${matchedStr} from index ${startIndex} to ${endIndex} in the original string`);
  console.log(`From index ${startIndex - matchStart} to ${endIndex - matchStart} relative to the match start\n-----`);
});

Output:

aabccd from index 0 to 6 in the original string
From index 0 to 6 relative to the match start
-----
a from index 0 to 1 in the original string
From index 0 to 1 relative to the match start
-----
b from index 2 to 3 in the original string
From index 2 to 3 relative to the match start
-----
cc from index 3 to 5 in the original string
From index 3 to 5 relative to the match start

Keep in mind that the indices array contains the indices of the matched groups relative to the start of the string, not relative to the start of the match.

A polyfill is available here.

Zamir answered 24/8, 2019 at 22:42 Comment(1)

See my answer below for an example of how to use the new functionality. – Euridice 23/3, 2023 at 2:47

I wrote MultiRegExp for this a while ago. As long as you don't have nested capture groups, it should do the trick. It works by inserting capture groups between those in your RegExp and using all the intermediate groups to calculate the requested group positions.

var exp = new MultiRegExp(/(a).(b)(c.)d/);
exp.exec("aabccde");

should return

{0: {index:0, text:'a'}, 1: {index:2, text:'b'}, 2: {index:3, text:'cc'}}

Live Version

Elvinaelvira answered 4/11, 2014 at 22:49 Comment(2)

Your object looks good! Though the live version gave error when I tried a regex of (ba)+.(a*) with text babaaaaa. – Poem 24/8, 2015 at 15:41

nice catch! This is the intended behavior but I need to update the error message. We need to have capture groups covering the whole output so repetitions on capture groups (which only return one of the matches) is not allowed. A quick fix is to add a sub group and change the regexp to /((?:ba)+).(a*)/. I have updated the readme on my git repo to describe this behavior. – Elvinaelvira 31/8, 2015 at 21:47

I created a little regexp Parser which is also able to parse nested groups like a charm. It's small but huge. No really. Like Donalds hands. I would be really happy if someone could test it, so it will be battle tested. It can be found at: https://github.com/valorize/MultiRegExp2

Usage:

let regex = /a(?: )bc(def(ghi)xyz)/g;
let regex2 = new MultiRegExp2(regex);

let matches = regex2.execForAllGroups('ababa bcdefghixyzXXXX'));

Will output:
[ { match: 'defghixyz', start: 8, end: 17 },
  { match: 'ghi', start: 11, end: 14 } ]

Alvarez answered 11/2, 2017 at 14:19 Comment(0)

As of 2023, you can do this with match() and the d flag mentioned here. So to solve the original example you would just add a d to the end of the regular expression:

let re = /(a).(b)(c.)d/d
let str = "aabccde"
let match = str.match(re)
console.log(match.indices) // [[0, 6], [0, 1], [2, 3], [3, 5]]

re = /(a).(.)/d
str = "aaa"
match = str.match(re)
console.log(match.indices) // [[0, 3], [0, 1], [2, 3]]

Fiddle here

Note that the first array is the start and end of the entire match. The subgroups come after that.

I would name the groups and then access their indices by name under the groups attribute (match.indices.groups).

Euridice answered 23/3, 2023 at 2:45 Comment(0)

So, you have a text and a regular expression:

txt = "aabccde";
re = /(a).(b)(c.)d/;

The first step is to get the list of all substrings that match the regular expression:

subs = re.exec(txt);

Then, you can do a simple search on the text for each substring. You will have to keep in a variable the position of the last substring. I've named this variable cursor.

var cursor = subs.index;
for (var i = 1; i < subs.length; i++){
    sub = subs[i];
    index = txt.indexOf(sub, cursor);
    cursor = index + sub.length;


    console.log(sub + ' at index ' + index);
}

EDIT: Thanks to @nhahtdh, I've improved the mechanism and made a complete function:

String.prototype.matchIndex = function(re){
    var res  = [];
    var subs = this.match(re);

    for (var cursor = subs.index, l = subs.length, i = 1; i < l; i++){
        var index = cursor;

        if (i+1 !== l && subs[i] !== subs[i+1]) {
            nextIndex = this.indexOf(subs[i+1], cursor);
            while (true) {
                currentIndex = this.indexOf(subs[i], index);
                if (currentIndex !== -1 && currentIndex <= nextIndex)
                    index = currentIndex + 1;
                else
                    break;
            }
            index--;
        } else {
            index = this.indexOf(subs[i], cursor);
        }
        cursor = index + subs[i].length;

        res.push([subs[i], index]);
    }
    return res;
}


console.log("aabccde".matchIndex(/(a).(b)(c.)d/));
// [ [ 'a', 1 ], [ 'b', 2 ], [ 'cc', 3 ] ]

console.log("aaa".matchIndex(/(a).(.)/));
// [ [ 'a', 0 ], [ 'a', 1 ] ] <-- problem here

console.log("bababaaaaa".matchIndex(/(ba)+.(a*)/));
// [ [ 'ba', 4 ], [ 'aaa', 6 ] ]

Epa answered 14/4, 2013 at 23:39 Comment(8)

This is definitely not the solution for general case. e.g. text = "babaaaaa" and re = /(ba)+.(a*)/ – Powers 15/4, 2013 at 0:4

With your example I get, ba at index 0 aaa at index 3. What is the expected result? – Epa 15/4, 2013 at 0:12

ba should be at index 2, and aaa should be at index 5. baba will be matched by (ba)+, but since the captured part is repeated, only the last instance is captured, and therefore index 2 (it doesn't really matter in this case, but it matters when input is "bbbaba" and regex is /(b+a)+/). aaa is at index 5, because babaa is matched by (ba)+. and the rest aaa are matched by (a*). – Powers 15/4, 2013 at 0:43

re = /((ba))+.(a*)/ it works when the regex capture ba twice. – Epa 15/4, 2013 at 0:57

The point is not modifying the regex to make your solution looks good. The point is make your solution consistent with what the engine actually does inside. It is pretty clear that your solution may or may not work depending on the regex, so one more example doesn't amount to anything. – Powers 15/4, 2013 at 0:58

I've editted my answer. The if statement is for the case you've noticed. – Epa 15/4, 2013 at 1:57

It is still wrong. aaa should be at index 7 (for last test case). (I doubt there is a simple general solution without analyzing the regex). – Powers 15/4, 2013 at 2:2

General suggestion: do not pollute system prototypes with your own code... – Valois 18/11, 2019 at 1:0

Updated Answer: 2022

See String.prototype.matchAll

The matchAll() method matches the string against a regular expression and returns an iterator of matching results.

Each match is an array, with the matched text as the first item, and then one item for each parenthetical capture group. It also includes the extra properties index and input.

let regexp = /t(e)(st(\d?))/g;
let str = 'test1test2';

for (let match of str.matchAll(regexp)) {
  console.log(match)
}

// => ['test1', 'e', 'st1', '1', index: 0, input: 'test1test2', groups: undefined]
// => ['test2', 'e', 'st2', '2', index: 5, input: 'test1test2', groups: undefined]

Frunze answered 7/3, 2022 at 3:31 Comment(1)

I fail to see how this answers the original question. – Euridice 23/3, 2023 at 1:37

-1

Based on the ecma regular expression syntax I've written a parser respective an extension of the RegExp class which solves besides this problem (full indexed exec method) as well other limitations of the JavaScript RegExp implementation for example: Group based search & replace. You can test and download the implementation here (is as well available as NPM module).

The implementation works as follows (small example):

//Retrieve content and position of: opening-, closing tags and body content for: non-nested html-tags.
var pattern = '(<([^ >]+)[^>]*>)([^<]*)(<\\/\\2>)';
var str = '<html><code class="html plain">first</code><div class="content">second</div></html>';
var regex = new Regex(pattern, 'g');
var result = regex.exec(str);

console.log(5 === result.length);
console.log('<code class="html plain">first</code>'=== result[0]);
console.log('<code class="html plain">'=== result[1]);
console.log('first'=== result[3]);
console.log('</code>'=== result[4]);
console.log(5=== result.index.length);
console.log(6=== result.index[0]);
console.log(6=== result.index[1]);
console.log(31=== result.index[3]);
console.log(36=== result.index[4]);

I tried as well the implementation from @velop but the implementation seems buggy for example it does not handle backreferences correctly e.g. "/a(?: )bc(def(\1ghi)xyz)/g" - when adding paranthesis in front then the backreference \1 needs to be incremented accordingly (which is not the case in his implementation).

Amblyopia answered 20/4, 2017 at 6:49 Comment(0)

-4

I'm not exactly sure exactly what your requirements are for your search, but here's how you could get the desired output in your first example using Regex.exec() and a while-loop.

JavaScript

var myRe = /^a|b|c./g;
var str = "aabccde";
var myArray;
while ((myArray = myRe.exec(str)) !== null)
{
  var msg = '"' + myArray[0] + '" ';
  msg += "at index = " + (myRe.lastIndex - myArray[0].length);
  console.log(msg);
}

Output

"a" at index = 0
"b" at index = 2
"cc" at index = 3

Using the lastIndex property, you can subtract the length of the currently matched string to obtain the starting index.

Forequarter answered 15/4, 2013 at 3:10 Comment(3)

This is a totally wrong approach. Take the input "baaccde" for example. It does not match OP's original regex, but your regex will match it. – Powers 15/4, 2013 at 3:27

To be honest, the example is completely contrived. All it basically asks for is given the string: "aabccde", what are the the indices of the first "a", "b" and "cc"? This answer is merely to show a way to get the indices of the matches. You could easily check to make sure that the string matches before getting the indices, but I'll try improve my answer. – Forequarter 15/4, 2013 at 3:41

Take a look at OP's second test case. – Powers 15/4, 2013 at 4:23

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags