General approach for (equivalent of) "backreferences within character class"?
Asked Answered
L

1

19

In Perl regexes, expressions like \1, \2, etc. are usually interpreted as "backreferences" to previously captured groups, but not so when the \1, \2, etc. appear within a character class. In the latter case, the \ is treated as an escape character (and therefore \1 is just 1, etc.).

Therefore, if (for example) one wanted to match a string (of length greater than 1) whose first character matches its last character, but does not appear anywhere else in the string, the following regex will not do:

/\A       # match beginning of string;
 (.)      # match and capture first character (referred to subsequently by \1);
 [^\1]*   # (WRONG) match zero or more characters different from character in \1;
 \1       # match \1;
 \z       # match the end of the string;
/sx       # s: let . match newline; x: ignore whitespace, allow comments

would not work, since it matches (for example) the string 'a1a2a':

  DB<1> ( 'a1a2a' =~ /\A(.)[^\1]*\1\z/ and print "fail!" ) or print "success!"
fail!

I can usually manage to find some workaround1, but it's always rather problem-specific, and usually far more complicated-looking than what I would do if I could use backreferences within a character class.

Is there a general (and hopefully straightforward) workaround?


1 For example, for the problem in the example above, I'd use something like

/\A
 (.)              # match and capture first character (referred to subsequently
                  # by \1);
 (?!.*\1\.+\z)    # a negative lookahead assertion for "a suffix containing \1";
 .*               # substring not containing \1 (as guaranteed by the preceding
                  # negative lookahead assertion);
 \1\z             # match last character only if it is equal to the first one
/sx

...where I've replaced the reasonably straightforward (though, alas, incorrect) subexpression [^\1]* in the earlier regex with the somewhat more forbidding negative lookahead assertion (?!.*\1.+\z). This assertion basically says "give up if \1 appears anywhere beyond this point (other than at the last position)." Incidentally, I give this solution just to illustrate the sort of workarounds I referred to in the question. I don't claim that it is a particularly good one.

Leucocytosis answered 14/8, 2013 at 20:54 Comment(1)
The accepted solution is perfect for negation, but won't cover some other uses of character classes, such as ranges. Suppose you wanted to match all sequences of 3 digits in non-decreasing order (so "111", "123", "368", "449", but not "987" or "322"). Using backrefs in character classes, the pseudo-regex would be /^([0-9])([\1-9])([\2-9])$/, but you can't accomplish the same as simply with a negative lookahead.Peskoff
O
15

This can be accomplished with a negative lookahead within a repeated group:

/\A         # match beginning of string;
 (.)        # match and capture first character (referred to subsequently by \1);
 ((?!\1).)* # match zero or more characters different from character in \1;
 \1         # match \1;
 \z         # match the end of the string;
/sx

This pattern can be used even if the group contains more than one character.

Orpington answered 14/8, 2013 at 20:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.