Light regexp optimization

I have a regular expression that was the output of a computer program. It has things like

(((2)|(9)))*

which a human would undoubtedly write as

[29]*

So I'd like a program that can make simple transformations that make the regular expression more readable. So far I've been using a quick script

$r =~ s/\(([0-9])\)/$1/g;
$r =~ s/\(([0-9])\|([0-9])\)/[$1$2]/g;
$r =~ s/\(([0-9]|\[[0-9]+\])\)\*/$1*/g;
$r =~ s/\((\[[0-9]+\]\*)\)/$1/g;
$r =~ s/\|\(([^()]+)\)\|/|$1|/g;

that bring down the length, but the result still contains pieces like

(ba*b)|ba*c|ca*b|ca*c

that should be simplified to

[bc]a*[bc]

I searched CPAN and found Regexp::List, Regexp::Assemble, and Regexp::Optimizer. The first two don't apply and the third has issues. First, it won't pass its tests so I can't use it unless I force install Regexp::Optimizer in cpan. Second, even once I do that it chokes on the expression.

Note: I tagged this [regular-language] in addition to [regex] because the regexp uses only concatenation, alternation, and the Kleene star, so it is in fact regular.

I feel like there might be a way to do this by converting the regular expression into a grammar, putting the grammar into Chomsky Normal Form, merging common nonterminals, and looking for patterns using some comparison heurstic. You might even get more concise answers if you don't put it in "real" CNF... I'd leave the lambdas/epsilons inside.

  ba*b|ba*c|ca*b|ca*c

  S -> bAb | bAc | cAb | cAc
  A -> aA | lambda

  S -> BAB | BAC | CAB | CAC
  A -> AA | a | lambda
  B -> b
  C -> c

  S -> DB | DC | EB | EC
  A -> AA | a | lambda
  B -> b
  C -> c
  D -> BA
  E -> CA

At this point, maybe you find a heuristic that recognizes

  S -> (D+E)(B+C)

Backsubstituting,

  S -> (BA|CA)(b|c) -> (ba*|ca*)(b|c)

Repeat this on subexpressions, e.g.,

  S' -> bA' | cA'
  A' -> aA' | lambda

  S' -> B'A' | C'A'
  A' -> A'A' | a | lambda
  B' -> b
  C' -> c

Now recognizing that S -> (B|C)(A), we can get

 S' -> (B'|C')(A') -> (b|c)(a*)

For a final solution of

 S -> ((b|c)a*)(b|c)

Then, you can just look for excess parentheses to remove (noting that concatenation is associative, and this will essentially optimize things into concatenative normal form, just drop all parentheses that don't enclose only a |-delimited list of options... so the above becomes

  (b|c)a*(b|c)

The trick is coming up with the heuristic, and this may not do all the optimizations which are possible. I don't know how it will perform. Still, it might be something to consider.

Recommended topics

Hot tags