Why doesn't finite repetition in lookbehind work in some flavors?
Asked Answered
E

4

5

I want to parse the 2 digits in the middle from a date in dd/mm/yy format but also allowing single digits for day and month.

This is what I came up with:

(?<=^[\d]{1,2}\/)[\d]{1,2}

I want a 1 or 2 digit number [\d]{1,2} with a 1 or 2 digit number and slash ^[\d]{1,2}\/ before it.

This doesn't work on many combinations, I have tested 10/10/10, 11/12/13, etc...

But to my surprise (?<=^\d\d\/)[\d]{1,2} worked.

But the [\d]{1,2} should also match if \d\d did, or am I wrong?

Equation answered 1/7, 2010 at 15:58 Comment(1)
I use C# but tested the above regex on a web regex tester. I didn't know there are such significant differences in the different languages.Equation
R
15

On lookbehind support

Major regex flavors have varying supports for lookbehind differently; some imposes certain restrictions, and some doesn't even support it at all.

  • Javascript: not supported
  • Python: fixed length only
  • Java: finite length only
  • .NET: no restriction

References


Python

In Python, where only fixed length lookbehind is supported, your original pattern raises an error because \d{1,2} obviously does not have a fixed length. You can "fix" this by alternating on two different fixed-length lookbehinds, e.g. something like this:

(?<=^\d\/)\d{1,2}|(?<=^\d\d\/)\d{1,2}

Or perhaps you can put both lookbehinds as alternates of a non-capturing group:

(?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}

(note that you can just use \d without the brackets).

That said, it's probably much simpler to use a capturing group instead:

^\d{1,2}\/(\d{1,2})

Note that findall returns what group 1 captures if you only have one group. Capturing group is more widely supported than lookbehind, and often leads to a more readable pattern (such as in this case).

This snippet illustrates all of the above points:

p = re.compile(r'(?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}')

print(p.findall("12/34/56"))   # "[34]"
print(p.findall("1/23/45"))    # "[23]"

p = re.compile(r'^\d{1,2}\/(\d{1,2})')

print(p.findall("12/34/56"))   # "[34]"
print(p.findall("1/23/45"))    # "[23]"

p = re.compile(r'(?<=^\d{1,2}\/)\d{1,2}')
# raise error("look-behind requires fixed-width pattern")

References


Java

Java supports only finite-length lookbehind, so you can use \d{1,2} like in the original pattern. This is demonstrated by the following snippet:

    String text =
        "12/34/56 date\n" +
        "1/23/45 another date\n";

    Pattern p = Pattern.compile("(?m)(?<=^\\d{1,2}/)\\d{1,2}");
    Matcher m = p.matcher(text);
    while (m.find()) {
        System.out.println(m.group());
    } // "34", "23"

Note that (?m) is the embedded Pattern.MULTILINE so that ^ matches the start of every line. Note also that since \ is an escape character for string literals, you must write "\\" to get one backslash in Java.


C-Sharp

C# supports full regex on lookbehind. The following snippet shows how you can use + repetition on a lookbehind:

var text = @"
1/23/45
12/34/56
123/45/67
1234/56/78
";

Regex r = new Regex(@"(?m)(?<=^\d+/)\d{1,2}");
foreach (Match m in r.Matches(text)) {
  Console.WriteLine(m);
} // "23", "34", "45", "56"

Note that unlike Java, in C# you can use @-quoted string so that you don't have to escape \.

For completeness, here's how you'd use the capturing group option in C#:

Regex r = new Regex(@"(?m)^\d+/(\d{1,2})");
foreach (Match m in r.Matches(text)) {
  Console.WriteLine("Matched [" + m + "]; month = " + m.Groups[1]);
}

Given the previous text, this prints:

Matched [1/23]; month = 23
Matched [12/34]; month = 34
Matched [123/45]; month = 45
Matched [1234/56]; month = 56

Related questions

Residential answered 1/7, 2010 at 16:5 Comment(1)
I got a positive look-behind with a + repetition to work on Java 11. "a 111x b111x c".replaceFirst("(?<=b1+)x", "z") gave a 111x b111z c. I don’t know if it’s an improvement in some Java version.Redress
B
4

Unless there's a specific reason for using the lookbehind which isn't noted in the question, how about simply matching the whole thing and only capturing the bit you're interested in instead?

JavaScript example:

>>> /^\d{1,2}\/(\d{1,2})\/\d{1,2}$/.exec("12/12/12")[1]
"12"
Babylon answered 1/7, 2010 at 16:5 Comment(0)
S
3

To quote regular-expressions.info:

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.

In other words your regex does not work because you're using a variable-width expression inside a lookbehind and your regex engine does not support that.

Superfine answered 1/7, 2010 at 16:2 Comment(0)
I
2

In addition to those listed by @polygenelubricants, there are two more exceptions to the "fixed length only" rule. In PCRE (the regex engine for PHP, Apache, et al) and Oniguruma (Ruby 1.9, Textmate), a lookbehind may consist of an alternation in which each alternative may match a different number of characters, as long as the length of each alternative is fixed. For example:

(?<=\b\d\d/|\b\d/)\d{1,2}(?=/\d{2}\b)

Note that the alternation has to be at the top level of the lookbehind subexpression. You might, like me, be tempted to factor out the common elements, like this:

(?<=\b(?:\d\d/|\d)/)\d{1,2}(?=/\d{2}\b)

...but it wouldn't work; at the top level, the subexpression now consists of a single alternative with a non-fixed length.

The second exception is much more useful: \K, supported by Perl and PCRE. It effectively means "pretend the match really started here." Whatever appears before it in the regex is treated as a positive lookbehind. As with .NET lookbehinds, there are no restrictions; whatever can appear in a normal regex can be used before the \K.

\b\d{1,2}/\K\d{1,2}(?=/\d{2}\b)

But most of the time, when someone has a problem with lookbehinds, it turns out they shouldn't even be using them. As @insin pointed out, this problem can be solved much more easily by using a capturing group.

EDIT: Almost forgot JGSoft, the regex flavor used by EditPad Pro and PowerGrep; like .NET, it has completely unrestricted lookbehinds, positive and negative.

Inoperable answered 5/7, 2010 at 1:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.