Why does Java regex engine throw StringIndexOutOfBoundsException on a + repetition?
Asked Answered
M

1

15

I've written a regex pattern to find Fibonacci numbers (it doesn't matter why, I just did). It works wonderfully as expected (see on ideone.com):

    String FIBONACCI = 
        "(?x) .{0,2} | (?: (?=(\\2?)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)++ . ";

    for (int n = 0; n < 1000; n++) {
        String s = new String(new char[n]);
        if (s.matches(FIBONACCI)) {
            System.out.print(n + " ");
        }
    } // 0 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 

A possessive repetition (i.e. ++ on the main "loop") is crucial, because you don't want backtracking with this matching algorithm. However, making the repetition backtrackable (i.e. just + on the main "loop") results not in mismatches, but rather a runtime exception!!! (as seen on ideone.com):

Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
    String index out of range: -1

    at java.lang.String.charAt(String.java:686)
    at java.lang.Character.codePointAt(Character.java:2335)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3344)
    at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:3994)
    at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:3966)
    at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3916)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Matcher.match(Matcher.java:1127)
    at java.util.regex.Matcher.matches(Matcher.java:502)
    at java.util.regex.Pattern.matches(Pattern.java:930)
    at java.lang.String.matches(String.java:2090)

Can someone explain what happened here? Is this a bug in the Java regex engine?

Micro answered 13/9, 2010 at 7:10 Comment(8)
I bet it is because the regex engine is just plain tired of being asked to do silly things :-).Filiate
Finding fibonacci numbers with regex? poly, you're crazy!Borg
@Stephen: there's a quote like "if you don't know at least N ways to break something, you haven't really understood it yet", so that's all I'm trying to do is learn. @Andreas: Hey, the ++ version works! Now I just need to understand why the + raises an exception, and how to port this algorithm to .NET which doesn't have possessive repetition.Micro
now that's a messed up regular exception... :-) still trying to get my head around your regexp...Lollapalooza
@aioobe: once I fully understood this issue, I plan to fully explain how the algorithm works as an example of nested reference and possessive quantifier usage.Micro
Looks like a bug. If it helps, it seems to pass 4 and fail on 5 (which should match) - ideone.com/EXStT . I was also bored lately and did something related with a balancing group, but that isn't quite as difficult :)Bookplate
For what it's worth, even a backtrackable repetition in .NET works (ideone.com/yB16g). I thought for sure backtracking will cause mismatches. I guess even I don't fully understand my own algorithm =).Micro
@poly - there's another saying: "Too much knowledge is a dangerous thing".Filiate
M
11

Bug ID 6984178

There are many bugs related to the engine throwing StringIndexOutOfBoundsException (see: search results. This one in particular has been reported and internally accepted as Bug ID 6984178 (it may take a while for this to show up in the external database).

Here's a much simpler pattern that reproduces the bug (see also on ideone.com):

System.out.println(
   "abaab".matches("(?x) (?: (?=(a+)) \\1 b )* x")
); // StringIndexOutOfBounds: -1

Note that using *? or *+ simply returns false as expected.

It looks like the problem is triggered by the attempt to backtrack a greedy repetition when there's a reference to a capturing group inside a lookahead: the out of bounds index is the difference in length between the first and the second a+ (e.g. "aabaaaaab" gets -3).

One would likely have to debug the java.util.regex.Pattern source code to pinpoint the exact nature of the bug.


Exploring the Fibonacci pattern

On the Java engine, with greedy backtracking +

Here's a more verbose snippet to show just how bonkers the engine gets on this pattern:

String FIBONACCI = 
    "(?x) .{0,2} | (?: (?=(\\2|^)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)+ . ";

for (int n = 0; n < 1000; n++) {
    String s = new String(new char[n]);
    try {
        if (s.matches(FIBONACCI)) {
            System.out.printf("%n%s", n);
        }
    } catch (StringIndexOutOfBoundsException e) {
        String index = e.getMessage().replace("String index out of range: ", "");
        System.out.printf(" <%s:%s>", n, index);
    }
}

The (slightly edited) output is (as seen on ideone.com):

0 1 2 3 <5:-1>
6 <7:-1> ... <12:-1> <13:-3>
14 <15:-3> ... <33:-3> <34:-8>
35 <36:-8> ... <88:-8> <89:-21>
90 <91:-21> ... <232:-21> <233:-55>
234 <235:-55> ... <609:-55> <610:-144>
611 <612:-144> ...

So somehow the engine tries to access string indices at -1, -3, -8, -21, -55, -144, etc. Note that these are every other Fibonacci number, but negative. Note also that beyond the first few numbers, the rest of the matches (6, 14, 35, ...) are NOT Fibonacci numbers.


On the .NET engine, with greedy backtracking +

While the pattern was originally written with the necessity for possessive quantifier in mind, in fact a backtracking repetition will still yield the correct answer (assuming the engine isn't buggy like Java's). Here's a C# implementation on the .NET engine (see also on ideone.com):

Regex r = new Regex(
  @"(?x) ^.{0,1}$ | ^(?: (?=(\2?)) (?=(\2\3|^.)) (?=(\1)) \2)+ . $ "
);

for (int n = 0; n < 1000; n++) {
  if (r.IsMatch("".PadLeft(n))) {
    Console.Write("{0} ", n);
  }
}
// 0 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 

As you can see, the output is correct, even with a backtracking + "loop". In fact, precisely because it's a backtracking loop, the special case can be limited to just .{0,1} instead of .{0,2}.


On the Java engine, with reluctant backtracking +?

This works in Java as expected. Also, because it's reluctant, we can also limit the special case to just .{0,1} (see also on ideone.com):

String FIBONACCI = 
        "(?x) .{0,1} | (?: (?=(\\2|^)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)+? . ";

On the algorithm

The formula

The pattern exploits the Second Identity of Fibonacci Numbers:

alt text

This can be proven by induction.

The pattern

Let's use this version of the pattern (which works in Java, and when anchored, also works in C#):

                                                     summation 
free-space!                                            "loop"
    ↓                                                     ↓
   (?x) .{0,1} | (?: (?=(\2|^)) (?=(\2\3|^.)) (?=(\1)) \2)+? .
        \____/   \___________________________________/  ↑    ↑  
      base case      inductive case                    +Fi  +1
     for n = 0,1
                     (assertions don't "count" toward sum)!
                        $1 := $2    (or initialized with 0)
                        $2 := $2+$3 (or initialized with 1)
                        $3 := $1    (a temporary variable for the "swap")

The Fibonacci sequence generation is straightforward, based on the [$1, $2] := [$2, $2+$1] transition. Since the assertions are performed sequentially, a temporary variable is introduced (just like the single-assignment "pseudocode" version).

Micro answered 13/9, 2010 at 9:56 Comment(1)
FYI, most of those bug reports concern malformed regexes or replacement strings and how they're handled. The only bug is that it should throw a PatternSyntaxException (regex) or an IllegalArgumentException (replacement), not a SIOOBE. I've tried to get them to change that, to no avail.Propene

© 2022 - 2024 — McMap. All rights reserved.