Alphabetic order regex using backreferences

Asked 30/6, 2017 at 14:33 Answered 20/7, 2017 at 4:17

Solved regex backreference character-class

I recently came across a puzzle to find a regular expression that matches:

5-character-long strings comprised of lowercase English letters in ascending ASCII order

Valid examples include:

aaaaa
abcde
xxyyz
ghost
chips
demos

Invalid examples include:

abCde
xxyyzz
hgost
chps

My current solution is kludgy. I use the regex:

(?=^[a-z]{5}$)^(a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*)$

which uses a non-consuming capture group to assert a string length of 5, and then verifies that the string comprises of lowercase English letters in order (see Rubular).

Instead, I'd like to use back references inside character classes. Something like:

^([a-z])([\1-z])([\2-z])([\3-z])([\4-z])$

The logic for the solution (see Rubular) in my head is to capture the first character [a-z], use it as a backrefence in the second character class and so on. However, \1, \2 ... within character classes seem to refer to ASCII values of 1, 2... effectively matching any four- or five-character string.

I have 2 questions:

Can I use back references in my character classes to check for ascending order strings?
Is there any less-hacky solution to this puzzle?

Halloo answered 30/6, 2017 at 14:33 Comment(9)

As far as I can tell, your kludgy solution is as sweet as it gets because the character classes don't play nice with your backreferences (but I do like your logic, seems like a nice feature to have). Do you have a specific environment to run this regex in (Ruby only or agnostic)? I am sure the regex wizards of SO will be along shortly to add their expertise. – Camporee 30/6, 2017 at 22:35

Uh, no backreferences inside character classes. The reason is that character classes are composed at compile time. The determining factor is that a backreference is dynamic, and what disqualifies it inside a class is the range operator. So, they say.. No dynamic range elst the engine crashes in a C++ exception. – Elderly 13/7, 2017 at 18:55

came across a puzzle to find a regular expression - Be sure to cite that link so we may laugh at the puzzler.. – Elderly 13/7, 2017 at 18:58

Actually, it's quite easy in a Perl regex. You can do all sorts of stuff, like counting, sequences, bools, subtraction, etc... (?{..}). If you think you'd use Perl, then it's .. doable. Btw, your first regex is just fine. – Elderly 13/7, 2017 at 18:59

One other thing to note is that in character classes the range operator pertains to single characters (min,max) not a set of characters, like \pL-z throws a construction error. In that vein, a reference can contain multiple characters. – Elderly 13/7, 2017 at 19:39

Why is demos included in invalid list? – Sabir 13/7, 2017 at 22:5

ok, your regex looks pretty nice to me. You can slightly improve it by using: ^(?=[a-z]{5}$)a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$ – Sabir 13/7, 2017 at 22:9

duplicate of https://mcmap.net/q/686550/-regex-5-digits-in-increasing-order – Cliffcliffes 14/7, 2017 at 0:45

Question cannot be closed due to bounty, but that is a dead-on dupe link. – Camporee 14/7, 2017 at 9:54

I'm posting this answer more as a comment than an answer since it has better formatting than comments.

Related to your questions:

Can I use back references in my character classes to check for ascending order strings?

No, you can't. If you take a look a backref regular-expressions section, you will find below documentation:

Parentheses and Backreferences Cannot Be Used Inside Character Classes

Parentheses cannot be used inside character classes, at least not as metacharacters. When you put a parenthesis in a character class, it is treated as a literal character. So the regex [(a)b] matches a, b, (, and ).

Backreferences, too, cannot be used inside a character class. The \1 in a regex like (a)[\1b] is either an error or a needlessly escaped literal 1. In JavaScript it's an octal escape.

Regarding your 2nd question:

Is there any less-hacky solution to this puzzle?

Imho, your regex is perfectly well, you could shorten it very little at the beginning like this:

(?=^.{5}$)^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$
    ^--- Here

Regex demo

Assiduous answered 14/7, 2017 at 20:5 Comment(2)

I think it's worth saying why backreferences in character classes are not possible. A backreference can point to a matched substring, which could be any length, even empty. Thus, code like [\1-z] could yield [-z], [f-z], [foo-z] etc. That doesn't make much sense and the pattern would have to be recompiled on the fly, preventing optimizations. – Odlo 15/7, 2017 at 12:26

It's disturbing to see doc's quoted when they don't provide reasons for if something can't be used in such a fundamental way. References inside character classes is such an example. But, you could always use your own reasoning for such a case, and it's fairly easy to know why in this case. – Elderly 15/7, 2017 at 21:10

If you are willing to use Perl (!), this will work:

/^([a-z])((??{"[$1-z]"}))((??{"[$2-z]"}))((??{"[$3-z]"}))(??{"[$4-z]"})$/

Motmot answered 14/7, 2017 at 23:33 Comment(2)

But... that's cheating! :) If you use Perl then I guess you could just code the assertion directly: /^([a-z]{5})$(?(?{$1 eq (join "", sort split qr##, $1)})|(?!))/ – Odlo 15/7, 2017 at 12:21

At some point I think it is questionable as to whether you are doing a Regex solution. I think mine is closer on the continuum :) – Motmot 17/7, 2017 at 19:3

Since someone has broken the ice by using Perl, this is a
Perl solution I guess ..

Note that this is a basic non-regex solution that just happens to be
stuffed into code constructs inside a Perl regex.
The interesting thing is that if a day comes when you need the synergy
of regex/code this is a good choice.
It is possible then that instead of a simple [a-z] character, you may
use a very complex pattern in it's place and using a check vs. last.
That is power !!

The regex ^(?:([a-z])(?(?{ $last gt $1 })(?!)|(?{ $last = $1 }))){5}$

Perl code

use strict;
use warnings;


$/ = "";

my @DAry = split /\s+/, <DATA>;

my $last;

for (@DAry)
{
    $last = '';
    if ( 
      /
         ^                             # BOS
         (?:                           # Cluster begin
              ( [a-z] )                     # (1), Single a-z letter
                                            # Code conditional
              (?(?{
                   $last gt $1                  # last > current ?
                })
                   (?!)                          # Fail
                |                              # else,
                   (?{ $last = $1 })             # Assign last = current
              )
         ){5}                          # Cluster end, do 5 times
         $                             # EOS
      /x )
    {
        print "good   $_\n";
    }
    else {
        print "bad    $_\n";
    }
}

__DATA__

aaaaa
abcde
xxyyz
ghost
chips
demos
abCde
xxyyzz
hgost
chps

Output

good   aaaaa
good   abcde
good   xxyyz
good   ghost
good   chips
good   demos
bad    abCde
bad    xxyyzz
bad    hgost
bad    chps

Elderly answered 15/7, 2017 at 21:3 Comment(0)

Ah, well, it's a finite set, so you can always enumerate it with alternation! This emits a "brute force" kind of regex in a little perl REPL:

#include <stdio.h>

int main(void) {
  printf("while (<>) { if (/^(?:");
  for (int a = 'a'; a <= 'z'; ++a)
    for (int b = a; b <= 'z'; ++b)
      for (int c = b; c <= 'z'; ++c) {
        for (int d = c; d <= 'y'; ++d)
          printf("%c%c%c%c[%c-z]|", a, b, c, d, d);
        printf("%c%c%czz", a, b, c);
        if (a != 'z' || b != 'z' || c != 'z') printf("|\n");
      }
  printf(")$/x) { print \"Match!\\n\" } else { print \"No match.\\n\" }}\n");
  return 0;
}

And now:

$ gcc r.c
$ ./a.out > foo.pl
$ cat > data.txt
aaaaa
abcde
xxyyz
ghost
chips
demos
abCde
xxyyzz
hgost
chps
^D
$ perl foo.pl < data.txt
Match!
Match!
Match!
Match!
Match!
Match!
No match.
No match.
No match.
No match.

The regex is only 220Kb or so ;-)

Birddog answered 20/7, 2017 at 4:17 Comment(3)

Aargh. Your regex broke Rubular (can't make a permalink). :-) Yep, it's 219 KB raw and works! – Halloo 20/7, 2017 at 4:55

Thanks for this. Using the raw regex text (from @Halloo paste bin) I put it through this ternary tool and made a full blown regex trie. That reduced the size quite a bit (160k compressed, 840k formatted). Here is a Python test. – Elderly 22/7, 2017 at 22:45

Also, I generated the full 142,506 possible strings, then used the full trie regex I made to test the performance:

Regex1: <omitted>     Options:  < none >     Completed iterations:   1  /  1     ( x 1 )     Matches found per iteration:   142506     Elapsed Time:    0.18 s,   178.87 ms,   178871 µs

– Elderly 22/7, 2017 at 22:57

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags