How does this PCRE pattern detect palindromes?
Asked Answered
L

2

20

This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindromes, including the ones that can't be matched by the recursive pattern given in the PCRE man page.

Examine this PCRE pattern in PHP snippet:

$palindrome = '/(?x)
^
  (?:
      (.) (?=
              .*
              (
                \1
                (?(2) \2 | )
              )
              $
          )
  )*
  .?
  \2?
$


/';

This pattern seems to detect palindromes, as seen in this test cases (see also on ideone.com):

$tests = array(
  # palindromes
  '',
  'a',
  'aa',
  'aaa',
  'aba',
  'aaaa',
  'abba',
  'aaaaa',
  'abcba',
  'ababa',

  # non-palindromes
  'aab',
  'abab',
  'xyz',
);

foreach ($tests as $test) {
  echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);  
}

So how does this pattern work?


Notes

This pattern uses a nested reference, which is a similar technique used in How does this Java regex detect palindromes?, but unlike that Java pattern, there's no lookbehind (but it does use a conditional).

Also, note that the PCRE man page presents a recursive pattern to match some palindromes:

# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$

The man page warns that this recursive pattern can NOT detect all palindromes (see: Why will this recursive regex only match when a character repeats 2n - 1 times? and also on ideone.com), but the nested reference/positive lookahead pattern presented in this question can.

Limpid answered 19/9, 2010 at 16:27 Comment(7)
I wrote the pattern so I obviously know how it works but I'm going to let someone else answer it.Limpid
Is this a contest? If so maybe a bounty? =)Volcanism
I wondered where you found all those tricks with regexes, didn't though to look on pcre.org :POpposable
@Colin: The nested references stuff all came out of personal research. I only very recently found the PCRE man page, and I haven't really read through it yet.Limpid
@Limpid Can you give some of the references/resources you used during your research ? (Even if the PCRE.txt is quite exhaustive)Opposable
@Colin: java.util.regex.Pattern, MSDN/.NET Framework 4/Regular Expression Language Elements, regular-expressions.info. No doubt there are better resources out there, but that's pretty much all I used.Limpid
This question has been added to the Stack Overflow Regular Expression FAQ, under "Advanced Regex-Fu".Fixture
E
28

Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:

^(.)(.)(.) ... \3\2\1$

we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:

^(.)(?=.*\1$)
 (.)(?=.*\2\1$)
 (.)(?=.*\3\2\1$) ...

but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:

^(.)(?=.*(?<record>\1\k<record>)$)   # \1     = \1 + (empty)
 (.)(?=.*(?<record>\2\k<record>)$)   # \2\1   = \2 + \1
 (.)(?=.*(?<record>\3\k<record>)$)   # \3\2\1 = \3 + \2\1
 ...

which could be converted into

^(?: 
    (.)(?=.*(\1\2)$)
 )*

Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.

(?(2)\2|)   # matches \2 if it exist, empty otherwise.

so our expression becomes

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*

Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*\2$

We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*.?\2$

This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*.?\2?$
#      ^ since \2 must be at the end in the look-ahead anyway.
Esquibel answered 19/9, 2010 at 20:56 Comment(0)
T
1

I want to bring my very own solution to the table. This is a regex that I've written a while ago to solve matching palindromes using PCRE/PCRE2

^((\w)(((\w)(?5)\5?)*|(?1)|\w?)\2)$

Example: https://regex101.com/r/xvZ1H0/1

Thermolysis answered 14/10, 2021 at 19:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.