I have a regular expression that was the output of a computer program. It has things like
(((2)|(9)))*
which a human would undoubtedly write as
[29]*
So I'd like a program that can make simple transformations that make the regular expression more readable. So far I've been using a quick script
$r =~ s/\(([0-9])\)/$1/g;
$r =~ s/\(([0-9])\|([0-9])\)/[$1$2]/g;
$r =~ s/\(([0-9]|\[[0-9]+\])\)\*/$1*/g;
$r =~ s/\((\[[0-9]+\]\*)\)/$1/g;
$r =~ s/\|\(([^()]+)\)\|/|$1|/g;
that bring down the length, but the result still contains pieces like
(ba*b)|ba*c|ca*b|ca*c
that should be simplified to
[bc]a*[bc]
I searched CPAN and found Regexp::List, Regexp::Assemble, and Regexp::Optimizer. The first two don't apply and the third has issues. First, it won't pass its tests so I can't use it unless I force install Regexp::Optimizer
in cpan. Second, even once I do that it chokes on the expression.
Note: I tagged this [regular-language] in addition to [regex] because the regexp uses only concatenation, alternation, and the Kleene star, so it is in fact regular.
(2)
when you can most likely use(?:2)
. They both make a grouping, but the latter doesn't capture, which saves the regular expression from storing back-references; an optimization which might save you quite a bit if you really nest those parens a lot. – Carrolcarroll(((2)|(9)))*
is not functionally equivalent to[29]*
, although(?:(?:(?:2)|(?:9)))*
is. [oops, already stated] – Phenomenon