Backreferences in lookbehind
Asked Answered
D

1

12

Can you use backreferences in a lookbehind?

Let's say I want to split wherever behind me a character is repeated twice.

    String REGEX1 = "(?<=(.)\\1)"; // DOESN'T WORK!
    String REGEX2 = "(?<=(?=(.)\\1)..)"; // WORKS!

    System.out.println(java.util.Arrays.toString(
        "Bazooka killed the poor aardvark (yummy!)"
        .split(REGEX2)
    )); // prints "[Bazoo, ka kill, ed the poo, r aa, rdvark (yumm, y!)]"

Using REGEX2 (where the backreference is in a lookahead nested inside a lookbehind) works, but REGEX1 gives this error at run-time:

Look-behind group does not have an obvious maximum length near index 8
(?<=(.)\1)
        ^

This sort of make sense, I suppose, because in general the backreference can capture a string of any length (if the regex compiler is a bit smarter, though, it could determine that \1 is (.) in this case, and therefore has a finite length).

So is there a way to use a backreference in a lookbehind?

And if there isn't, can you always work around it using this nested lookahead? Are there other commonly-used techniques?

Dapple answered 29/4, 2010 at 5:34 Comment(3)
Interesting, and +1 for your ingenious workaround. I don't use Java, so I can't try it myself - what happens if the backreferenced group is outside the lookaround, like (?<=\\1)(.)?Brahms
@Tim: it results in essentially the same PatternSyntaxException. By the way, if anybody wants to play around with a variant of this problem, I just authored one on codingBat: codingbat.com/prob/p266235Dapple
@Dapple I wish I could upvote this regex : (?<=(?=(.)\\1)..) for at least 10 times. very elegant!Beal
B
5

Looks like your suspicion is correct that backreferences generally can't be used in Java lookbehinds. The workaround you proposed makes the finite length of the lookbehind explicit and looks very clever to me.

I was intrigued to find out what Python does with this regex. Python only supports fixed-length lookbehind, not finite-length like Java, but this regex is fixed length. I couldn't use re.split() directly because Python's re.split() never splits on an empty match, but I think I found a bug in re.sub():

>>> r=re.compile("(?<=(.)\\1)")
>>> a=re.sub(r,"|", "Bazooka killed the poor aardvark (yummy!)")
>>> a
'Bazo|oka kil|led the po|or a|ardvark (yum|my!)'

The lookbehind matches between the two duplicate characters!

Brahms answered 29/4, 2010 at 7:59 Comment(4)
Check out #2629034 for more regex fun.Dapple
That's stupid that re.split() doesn't split on an empty match, though. Why the heck would they do it like that? I'd think there's plenty of times you want to split simply based on assertions instead of actual non-empty delimiter.Dapple
I have asked the same thing on the Python bugtracker. It was probably unintended, but is being left alone to not cause compatibility problems; there is a major regex engine overhaul underway, but it might be a while until the new regex module is merged into the standard library.Brahms
Java's regex package originally had the same bug, where lookbehinds were not anchored at the current match position, but it was fixed in JDK 1.6.Fanchette

© 2022 - 2024 — McMap. All rights reserved.