How does this Java regex detect palindromes?
Asked Answered
S

1

23

This is the third part in a series of educational regex articles. It follows How does this regex find triangular numbers? (where nested references is first introduced) and How can we match a^n b^n with Java regex? (where the lookahead "counting" mechanism is further elaborated upon). This part introduces a specific form of nested assertion, which when combined with nested references allows Java regex to match what most people believe is "impossible": palindromes!!

The language of palindromes is non-regular; it's actually context-free (for a given alphabet). That said, modern regex implementation recognizes more than just regular languages, and Perl/PCRE's recursive patterns and .NET's balancing groups can readily recognize palindromes (see: Related Questions).

However, Java's regex engine supports neither of these "advanced" features. And yet "someone" (*wink*) managed to write the following regex which seems to do the job just fine (see also on ideone.com):

public class Palindrome {
    // asserts that the entirety of the string matches the given pattern
    static String assertEntirety(String pattern) {
        return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
    }

    public static void main(String[] args) {
        final String PALINDROME =
            "(?x) | (?:(.) add)+ chk"
                .replace("add", assertEntirety(".*? (\\1 \\2?)"))
                .replace("chk", assertEntirety("\\2"));

        System.out.println(PALINDROME);
        // (?x) | (?:(.) (?<=(?=^.*? (\1 \2?)$).*))+ (?<=(?=^\2$).*)
        
        String[] tests = {
            "",     // true
            "x",    // true
            "xx",   // true
            "xy",   // false
            "xyx",  // true
            "xxx",  // true
            "xxyx", // false
            "racecar",                // true
            "step on no pets",        // true
            "aManaPlanaCanalPanaMa",  // true
            "this is impossible",     // FALSE!!!
        };
        for (String test : tests) {
            System.out.printf("[%s] %s%n", test, test.matches(PALINDROME));
        }
    }
}

So this seems to work, but how?

References


COMMON SENSE ALERT!!!

This is not the best way to detect palindromes; it's O(N^3)at best. Performing this detection in a more general purpose programming language is both more efficient and more straightforward.

You wouldn't want to use regex to detect palindromes for the same reasons you wouldn't want to use regex to find prime numbers. That said, you would study how a non-recursive non-balancing group regex can detect palindromes for the same reasons you would study how a regex can be used for primality testing: it's fun, it's challenging, it's educational.

Related questions

Servility answered 8/9, 2010 at 5:34 Comment(7)
Discussion of the series: meta.stackexchange.com/questions/62695/… ; Also, no doubt there are typos/mistakes in this long article. Please leave comments and feedbacks as I do plan to continually revise.Servility
I would just like to mention that the use of regular expressions to detect palindromes is a particularly silly idea. There are much better ways to do it. By all means use this to educate yourself on regular expressions but part of that education is knowing when not to use them. Not trying to rain on your parade, @poly, I'm sure this is a good article :-)Margerymarget
@paxdiablo: This article isn't really about palindromes, for almost the same reason that fables aren't really about talking lions and/or singing donkeys. There are morals and lessons to be had here, and looking at this as just a palindrome detection regex merely scratches the surface.Servility
@poly - I agree with @Margerymarget and would add that doing complicated things in general with regexes is a silly / bad idea. IMO, this question / answer should be prefixed with the warning DO NOT DO THIS in flashing red 26 point capital letters.Pinhead
@pax, @Stephen: I think there's a BIG difference between asking "How do I parse this XML fragment with regex?" and "How can we detect palindromes with (Java) regex?". With the former, people don't know any better, but with the latter, people know it's not the best solution, but they ask because they want to learn. This is the same driving motivation that intrigues people to understand e.g. primality testing with regex. It's OBVIOUSLY not the "proper" solution, but it's intrinsically educational. That said, I will add a "COMMON SENSE ALERT" warning on next revision sometime later today.Servility
I'm happy with this question, given that it does follow the rules as set down in the FAQ. I'd also hate to see it go just because so much damn work went into it and the answer. I was just pointing out that palindrome detection with a regex wasn't the best use of it. It's likely I wouldn't have been so critical if I'd had the time to read the whole thing instead of just the title and first para :-)Margerymarget
Part 4: #3694198Servility
S
19

The Big Picture

We will first look at this regex from the overall big picture algorithm, and then take a closer look at the specific implementation details later. The regex is an almost direct translation of the following Java code:

static boolean isPalindrome(String s) {
   if (s.isEmpty()) {
      return true;
   }
   String g2 = null;
   for (char ch : s.toCharArray()) {
      String g1 = String.valueOf(ch);
      // "add"
      if (g2 != null && s.endsWith(g1 + g2)) {
         g2 = g1 + g2;
      } else if (s.endsWith(g1)) {
         g2 = g1;
      } else {
         break;
      }
   }
   return s.equals(g2); // "chk"
}

This is obviously not the most straightforward/efficient Java code to check for palindromes, but it works, and most fascinatingly, it's almost directly translatable to regex with a near one-to-one mapping. Here's the regex again, reproduced here for convenience, annotated to highlight the striking resemblance:

//  isEmpty  _for-loop_
//       ↓  /          \
    "(?x) | (?:(.) add)+ chk"
//             \_/  ↑
//             g1   loop body                   ___g2___
//                                             /        \
           .replace("add", assertEntirety(".*? (\\1 \\2?)"))
           .replace("chk", assertEntirety("\\2"));
                           // s.equals(g2)

Attachment: annotated and expanded version of the source code on ideone.com

(Feel free to ignore the details of assertEntirety for now: just think of it as a black box regex mechanism that can make an assertion on the entire string regardless of where we currently are.)

So the basic algorithm is that we try to build a suffix, subject to a palindromic constraint, as we scan the string left to right. We then check if we're able to build the complete string in this manner. If we can, then the string is a palindrome. Also, as a special case, the empty string is trivially a palindrome.

Once the big picture algorithm is understood, we can examine how the regex pattern implements it.


What's with all the String.replace?

Regex patterns in Java are ultimately nothing but strings, meaning they can be derived through string manipulations the way any string can. Yes, we can even use regex to generate a regex pattern -- a sort of meta-regexing approach, if you will.

Consider this example of initializing an int constant (which ultimately contains nothing but a number):

final int X = 604800;
final int Y = 60 * 60 * 24 * 7;
// now X == Y

The number assigned to X is a literal integer: we can clearly see what the number is. This is not the case with Y which uses an expression instead, and yet this formula seems to convey an idea of what this number represents. Even without proper naming of these constants, we nonetheless get the idea that Y probably represents the number of seconds in a week, even if we may not immediately know what the numeric value is. On the other hand, with X we know precisely that number, but we get less of an idea of what it represents.

The use of string replacements in the snippet is an analogous situation but for strings regex patterns. Instead of explicitly writing the pattern as one literal string, sometimes systematic and logical derivation (the "formula") of that value from simpler parts can be much more meaningful. This is especially true with regex, where it often matters more that we understand what the pattern does than being able to see what it looks like as a string literal (which isn't much of a looker anyway, what with all the extra backslashes).

A portion of the snippet is reproduced here again for convenience:

// the "formula"
     final String PALINDROME =
        "(?x) | (?:(.) add)+ chk"
           .replace("add", assertEntirety(".*? (\\1 \\2?)"))
           .replace("chk", assertEntirety("\\2"));

// the "value"
     System.out.println(PALINDROME);
     //                       ____add_____             chk_
     //               _______/            \____   _______/ \_____
     // (?x) | (?:(.) (?<=(?=^.*? (\1 \2?)$).*))+ (?<=(?=^\2$).*)
     //        |  \_/             \______/     |
     //        |   1                 2         |
     //        |_______________________________|

Without a doubt the "formula" is a lot more readable than the eventual string "value" in this case.

There are certainly much more sophisticated ways to programmatically generate a regex pattern, and it's certainly possible to write in such a way that obfuscates instead of accentuates its meaning, but mindful usage of even simple string replacements can still do wonder (as hopefully shown in this example).

Lesson: Consider programmatic generation of regex patterns.


How does add work?

The (?:(.) add)+ construct, where add is an assertion that does some sort of "counting", has already been thoroughly discussed in the previous two parts. Two features are worth noting, though:

  • The (.) captures into group 1, allowing backreference later on
  • The assertion is assertEntirety instead of just looking ahead from our current position
    • We'll discuss this in more detail later; just think of it as a way to assert on the entire string

The pattern applied to assertEntirety in add is the following:

# prefix   _suffix_
#    ↓    /        \
    .*?   ( \1 \2? )
#         \________/   i.e. a reluctant "whatever" prefix (as short as possible)
#          group 2          followed by a suffix captured into group 2

Note that group 2 is self-referencing with an optional specifier, a technique already discussed in part 2 of the series. Needless to say group 2 is our "counter" in this pattern: it's a suffix that we will try to grow leftward on every iteration of the "loop". As we iterate on each (.) left to right, we try to prepend that same character (using backreference to \1) to our suffix.

Recall again the Java code translation of the above pattern, reproduced here for convenience:

if (g2 != null && s.endsWith(g1 + g2)) {   // \2? is greedy, we try this first
   g2 = g1 + g2;
} else if (s.endsWith(g1)) {    // since \2? is optional, we may also try this
   g2 = g1;
} else {        // if there's no matching suffix, we "break" out of the "loop"
   break;
}

The fact that \2? is optional means a few things:

  • It provides a "base case" for the self-reference (the main reason we do this!)
  • Since \2? is part of the suffix pattern (and thus appears later in the overall pattern), the prefix part must be reluctant, hence .*? instead of .*. This allows \2? to exercise its greediness.
  • The "counter" may also "reset" and give the "wrong" result
    • In part 2 we showed how backtracking ? may result in the same kind of problematic resetting
      • We solved the problem by using possessive quantifier ?+, but this is not applicable here

The third point is elaborated further in the next section.

Lesson: Carefully analyze the interactions between greedy/reluctant repetitions in parts of a pattern.

Related questions


Why do we need a chk phase?

As alluded in the previous section, the optional and backtrackable \2? means that our suffix can shrink under some circumstances. We will examine such a scenario step by step for this input:

 x y x y z y x
↑
# Initial state, \2 is "uninitialized"
             _
(x)y x y z y x
  ↑
  # \1 captured x, \2 couldn't match \1\2 (since \2 is "uninitialized")
  #                but it could match \1 so it captured x
           ___
 x(y)x y z y x
    ↑
    # \1 captured y, \2 matched \1\2 and grew to capture yx
             _
 x y(x)y z y x
      ↑
      # \1 captured x, \2 couldn't match \1\2,
      #                but it could match \1 so it shrunk to capture x (!!!)
           ___
 x y x(y)z y x
        ↑
        # \1 captured y, \2 matched \1\2 and grew to capture yx
         _____
 x y x y(z)y x
          ↑
          # \1 captured z, \2 matched \1\2 and grew to capture zyx
       _______
 x y x y z(y)x
            ↑
            # \1 captured y, \2 matched \1\2 and grew to capture yzyx
     _________
 x y x y z y(x)
              ↑
              # \1 captured x, \2 matched \1\2 and grew to capture xyzyx

We can modify our pattern (and the corresponding Java code) to omit the chk phase, and see that indeed this is what happens:

    // modified pattern without a chk phase; yields false positives!
    final String PALINDROME_BROKEN =
        "(?x) | (?:(.) add)+"
            .replace("add", assertEntirety(".*? (\\1 \\2?)"));

    String s = "xyxyzyx"; // NOT a palindrome!!!
    
    Matcher m = Pattern.compile(PALINDROME_BROKEN).matcher(s);
    if (m.matches()) {
        System.out.println(m.group(2)); // prints "xyzyx"
    }

As explained, "xyxyzyx", which is NOT a palindrome, is falsely reported as one, because we didn't check if the growing suffix eventually became the complete string (which it clearly did not in this case). The chk phase (which is an assertEntirety of the pattern \2) is therefore an absolute necessity in our setup. We need to confirm that indeed we managed to grow our suffix all the way. If this is the case, then we have ourselves a palindrome.

Lesson: Carefully analyze any possibly unintended side-effects of optional self-reference matching.


The Main Course: assertEntirety

While it's neat that we can write a Java regex pattern to detect palindromes, everything here except assertEntirety has already been covered in the previous parts of the series. The only new thing here is this mysterious black box, this powerful mechanism that magically allowed us to do what is otherwise "impossible".

The assertEntirety construct is based on the following meta-pattern of nested lookarounds:

(?<=(?=^pattern$).*)

" I can see a place somewhere behind me where I can look ahead and see ^pattern$ "

The name "lookaround" implies the relativity to our current position: we're looking around us, perhaps in front of or behind, from where we're standing. By nesting a lookahead in a lookbehind in this manner, we're able to metaphorically "fly into the skies" and look at the entire picture.

Abstracting this meta-pattern into assertEntirety is a bit like writing preprocessing substitution macros. Having nested lookarounds everywhere probably hurts readability and maintainability, so we encapsulate it into assertEntirety, which not only hides the complexity of its inner workings, but also further emphasizes its semantics by giving it an appropriate name.

Lesson: Consider abstracting meta-patterns to hide complexity and convey semantics.


Appendix: On infinite-length lookbehind in Java

Observant readers would notice that assertEntirety contains a .* in a lookbehind, which makes its theoretical maximum length infinite. No, Java does not officially support infinite-length lookbehind. Yes, as it has been adequatedly demonstrated here, it works anyway. Officially it's categorized as a "bug"; but "someone"(*wink*) could also consider it a "hidden feature".

It's certainly possible that this "bug" will be "fixed" in the future. Removal of this hidden feature will break this particular solution to the Java regex palindrome problem.

Related questions

Servility answered 8/9, 2010 at 5:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.