Java replaceAll with backreferences [duplicate]

Asked 24/1, 2013 at 23:4 Answered 24/1, 2013 at 23:10

Possible Duplicate:
String.replaceAll() anomaly with greedy quantifiers in regex

I was writing code that uses Matcher#replaceAll and found following result highly confusing:

Pattern.compile("(.*)").matcher("sample").replaceAll("$1abc");

Now, I would expect the output to be sampleabc but Java throws at me sampleabcabc.

Does anybody have any ideas why?

Now, sure, when I anchor the pattern (^(.*)$) the issue goes away. Still I don't know why the hell would replaceAll do a double replacement like that.

And to add insult to injury, following code:

Pattern.compile("(.*)").matcher("sample").replaceFirst("$1abc")

works as expected, returning just sampleabc.

Insulin answered 24/1, 2013 at 23:4 Comment(1)

@Pshemo: you are right. I'm sorry I failed to find this prior submission. – Insulin 24/1, 2013 at 23:49

It looks like it's matching the empty string at the end of the input, for some reason. (I can see why it would match; I'm intrigued that it matches once and only once.)

If you change replaceAll("$1abc") to replaceAll("'$1'abc") the result is 'sample'abc''abc.

Note that if you change (.*) to (.+) then it works correctly, because it has to match at least one character.

The diagnosis is confirmed by this code:

Matcher matcher = Pattern.compile("(.*)").matcher("sample");

while (matcher.find()) {
    System.out.printf("%d to %d\r\n", 
                      matcher.start(), 
                      matcher.end());
}

... which outputs:

0 to 6
6 to 6

Unhorse answered 24/1, 2013 at 23:9 Comment(3)

+1 first to answer and to give solution... – Nightlong 24/1, 2013 at 23:15

I could not find where is documented that you can refer to a capturing group result with $1, $2 etc in the "replace with" string. In the Pattern javadoc is said that you refer with '\1', '\2' etc. – Itemize 2/11, 2015 at 15:47

@George: That's because you're looking in Pattern javadoc rather than in Matcher.replaceAll which states: "Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string." See the docs for appendReplacement for the "as described above" part. – Unhorse 2/11, 2015 at 16:41

There are two things going on here that explain why this happens:

(.*) will successfully match empty strings.
After a match succeeds, another match will be attempted one position after the end of the previous match.

So, after the entire string "sample" is matched, another match is attempted just after the e. Even though there are no characters left the match succeeds and a second replacement occurs.

Additional replacements do not occur because the regex engine will always move forward. Just after the last character is a valid starting index so the empty string will match once, but after the empty string is matched there are no more valid starting positions for the regex engine to attempt a match from.

As an alternative to adding a beginning of string anchor to your regex, you can modify your regex so it matches one or more character by changing (.*) to (.+).

Publias answered 24/1, 2013 at 23:10 Comment(0)

Recommended topics

Hot tags