Light regexp optimization
Asked Answered
A

1

6

I have a regular expression that was the output of a computer program. It has things like

(((2)|(9)))*

which a human would undoubtedly write as

[29]*

So I'd like a program that can make simple transformations that make the regular expression more readable. So far I've been using a quick script

$r =~ s/\(([0-9])\)/$1/g;
$r =~ s/\(([0-9])\|([0-9])\)/[$1$2]/g;
$r =~ s/\(([0-9]|\[[0-9]+\])\)\*/$1*/g;
$r =~ s/\((\[[0-9]+\]\*)\)/$1/g;
$r =~ s/\|\(([^()]+)\)\|/|$1|/g;

that bring down the length, but the result still contains pieces like

(ba*b)|ba*c|ca*b|ca*c

that should be simplified to

[bc]a*[bc]

I searched CPAN and found Regexp::List, Regexp::Assemble, and Regexp::Optimizer. The first two don't apply and the third has issues. First, it won't pass its tests so I can't use it unless I force install Regexp::Optimizer in cpan. Second, even once I do that it chokes on the expression.


Note: I tagged this [regular-language] in addition to [regex] because the regexp uses only concatenation, alternation, and the Kleene star, so it is in fact regular.

Arrowworm answered 19/8, 2011 at 17:51 Comment(8)
In the theory of regular languages, there are algorithms to take a regular expression, produce an NFA-lambda, generate an equivalent DFA, minimize the DFA, and produce a regular expression again. This could be one approach, although efficiency will not be good. If you want me to go over this in more detail anyway, let me know.Niela
The theoretical approach is definitely the way to go over a fixed set of rules. I'm not quite sure I buy this as an "optimization" though - assuming whatever uses the regular expression is half-decent it'll be doing all you can do and more. Most likely the DIY approach will introduce bugs and not produce much/any speed improvements. The only reason I'd go down that road would be if I was expecting humans to work with the output of your program directly.Ventris
en.wikipedia.org/wiki/DFA_minimizationVentris
@Patrick87: Unfortunately true optimization is well out of reach here: the regexp is too long. What I need is a heuristic method that's able to do some reasonable improvements.Arrowworm
@awoodland: I'm looking more for size and comprehension improvements rather than speed increases (which would be nice).Arrowworm
This is a bit tangential, but you are using (2) when you can most likely use (?:2). They both make a grouping, but the latter doesn't capture, which saves the regular expression from storing back-references; an optimization which might save you quite a bit if you really nest those parens a lot.Carrolcarroll
@Conspicuous Compiler: Right, I'm not using any of the parentheses as capturing groups. I might preprocess the regex to change the parens but for now I'm more concerned with making it human-readable.Arrowworm
(((2)|(9)))* is not functionally equivalent to [29]*, although (?:(?:(?:2)|(?:9)))* is. [oops, already stated]Phenomenon
N
3

I feel like there might be a way to do this by converting the regular expression into a grammar, putting the grammar into Chomsky Normal Form, merging common nonterminals, and looking for patterns using some comparison heurstic. You might even get more concise answers if you don't put it in "real" CNF... I'd leave the lambdas/epsilons inside.

  ba*b|ba*c|ca*b|ca*c

  S -> bAb | bAc | cAb | cAc
  A -> aA | lambda

  S -> BAB | BAC | CAB | CAC
  A -> AA | a | lambda
  B -> b
  C -> c

  S -> DB | DC | EB | EC
  A -> AA | a | lambda
  B -> b
  C -> c
  D -> BA
  E -> CA

At this point, maybe you find a heuristic that recognizes

  S -> (D+E)(B+C)

Backsubstituting,

  S -> (BA|CA)(b|c) -> (ba*|ca*)(b|c)

Repeat this on subexpressions, e.g.,

  S' -> bA' | cA'
  A' -> aA' | lambda

  S' -> B'A' | C'A'
  A' -> A'A' | a | lambda
  B' -> b
  C' -> c

Now recognizing that S -> (B|C)(A), we can get

 S' -> (B'|C')(A') -> (b|c)(a*)

For a final solution of

 S -> ((b|c)a*)(b|c)

Then, you can just look for excess parentheses to remove (noting that concatenation is associative, and this will essentially optimize things into concatenative normal form, just drop all parentheses that don't enclose only a |-delimited list of options... so the above becomes

  (b|c)a*(b|c)

The trick is coming up with the heuristic, and this may not do all the optimizations which are possible. I don't know how it will perform. Still, it might be something to consider.

Niela answered 19/8, 2011 at 19:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.