String.replaceAll(regex) makes the same replacement twice
Asked Answered
T

2

46

Can anyone tell me why

System.out.println("test".replaceAll(".*", "a"));

Results in

aa

Note that the following has the same result:

System.out.println("test".replaceAll(".*$", "a"));

I have tested this on java 6 & 7 and both seem to behave the same way. Am I missing something or is this a bug in the java regex engine?

Tammara answered 22/12, 2011 at 13:7 Comment(0)
O
67

This is not an anomaly: .* can match anything.

You ask to replace all occurrences:

  • the first occurrence does match the whole string, the regex engine therefore starts from the end of input for the next match;
  • but .* also matches an empty string! It therefore matches an empty string at the end of the input, and replaces it with a.

Using .+ instead will not exhibit this problem since this regex cannot match an empty string (it requires at least one character to match).

Or, use .replaceFirst() to only replace the first occurrence:

"test".replaceFirst(".*", "a")
       ^^^^^^^^^^^^

Now, why .* behaves like it does and does not match more than twice (it theoretically could) is an interesting thing to consider. See below:

# Before first run
regex: |.*
input: |whatever
# After first run
regex: .*|
input: whatever|
#before second run
regex: |.*
input: whatever|
#after second run: since .* can match an empty string, it it satisfied...
regex: .*|
input: whatever|
# However, this means the regex engine matched an empty input.
# All regex engines, in this situation, will shift
# one character further in the input.
# So, before third run, the situation is:
regex: |.*
input: whatever<|ExhaustionOfInput>
# Nothing can ever match here: out

Note that, as @A.H. notes in the comments, not all regex engines behave this way. GNU sed for instance will consider that it has exhausted the input after the first match.

Omora answered 22/12, 2011 at 13:10 Comment(6)
Agreed. This is true for Perl too. perl -le '$x = "test"; $x =~ s/.*/a/g; print $x' yields "aa".Such
@ChrisDolan: sed yields only a, but I doubt its a bug. :-)Taster
@Taster yes indeed... I need to read "Mastering Regular Expressions" againOmora
Thanks for the feedback, I've been using regexes for a long time but never ran into this one. Learn something new every day...Tammara
Another way to solve this: Use ^.* - this will only match a once for obvious reasons.Peyote
Is this [buggy] behavior still the same after this fix?: bugs.openjdk.org/browse/JDK-7189363Dudgeon
A
2

The accepted answer hasn't shown this yet, so here is an alternative way to fix your regex:

System.out.println("test".replaceAll("^.*$", "a"));

Note, I'm using both terminators: ^ and $. The $ isn't strictly necessary for this particular case, but I find adding both least cryptic.

Ambiguous answered 30/3, 2022 at 9:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.