How to find a "minimal spanning set" for a collection of regular expressions?
Asked Answered
L

1

6

CONTEXT:

I have a smallish (currently less than 100) but growing collection of Regular Expressions, and I want to optimize the process of determining for a given text string which of the REs in my collection match the text string.

Some of the REs have an ordering relationship - for example if I know that the string $t matches /windows/i then I also know that $t matches /windows.*2000/i. So when testing $t against the REs in my collection I can skip testing /windows/i if I've already tested $t against /windows.*2000/i and found a match (although if /windows.*2000/i does not match then of course I cannot skip the test against /windows/i).

Note that none of the REs in my collection are entirely equivalent (for any pair of REs there is at least one text string which matches one and does not match the other).

STRATEGY:

I want to build a directed graph G with a node for each RE in my collection and a directed edge for each pair of REs with an ordering relationship (A -> B means "match against A implies match against B"), and find a "minimal spanning set" of nodes for the graph (minimal set of nodes S such that every node in G lies on a directed path which originates in S).

THE EASY PART:

There are lots of freely available algorithms for working with Directed Acyclic Graphs. So once the graph G is built for my collection of REs (which being distinct should guarantee that G is acyclic) I don't expect to have much difficulty finding an appropriate algorithm for finding a minimal spanning set for G.

WHERE I NEED HELP:

I'd like to find an efficient way to find all the ordering relationships between the REs in my collection - and perhaps also to ensure that no two REs in the collection are equivalent (I will need a way to automatically verify this as new REs are added).

My (essentially random) web searches have thus turned up at least one plausible claim that a reasonable way to compute what (if any) ordering relationship exists between two REs does indeed exist, but have not yet turned up any descriptions of a complete algorithm.

Does anyone know of an existing implementation (for comparing REs) which is reasonably efficient, freely available, and (ideally) implemented either in one of the popular scripting languages or C/C++?

Lockage answered 2/5, 2011 at 18:25 Comment(4)
Here's someone who thinks he can do it in Perl: perlmonks.org/bare/?node_id=675704Lockage
And here's someone who sounds like he would like to make RE equivalence checking into a CS theory exam question: perlmonks.org/?node_id=423543 (this dampens my optimism somewhat - it may be harder to find a nice ready-made solution that I had hoped/expected).Lockage
In python it appears to be possible to determine whether two REs are equivalent by capturing & comparing the output of the RE compiler (but this approach won't give any information about "ordering" REs which are not equivalent): #21398751Lockage
The accepted answer to this question appears to suggest an approach to computing an ordering on REs in general: #6363897Lockage
K
2

I am not sure if you have flexibility in terms of the regular expression library that you need to use, but you could look at RE2 whose Set interface can match multiple regexes simultaneously. Note that RE2 uses primarily a DFA approach, and does not support all of the regex features that other, mostly backtracking, implementations do.

Kandi answered 3/5, 2011 at 11:50 Comment(2)
My REs have no backreferences, so the RE2 library looks perfect for my application, Thanks! Do you happen know if someone has already built a PHP extension for it (as appears to have been done ages ago for Python), or will I have to roll my own?Lockage
@peter - I think you will have to roll your own (unless one exists that I don't know of), but the API is so simple that it should be pretty straightforward. Remember also that you should sort the regexes before feeding them to the Set. Remember also that this is pretty cutting edge - the non-set RE2 implementation is in use at google AFAIK, but I don't think the Set is. You should join the re2-dev googlegroup and feel free to ask Russ Cox (the developer) and other users about these kinds of things.Kandi

© 2022 - 2024 — McMap. All rights reserved.