How to find a "minimal spanning set" for a collection of regular expressions?

CONTEXT:

I have a smallish (currently less than 100) but growing collection of Regular Expressions, and I want to optimize the process of determining for a given text string which of the REs in my collection match the text string.

Some of the REs have an ordering relationship - for example if I know that the string $t matches /windows/i then I also know that $t matches /windows.*2000/i. So when testing $t against the REs in my collection I can skip testing /windows/i if I've already tested $t against /windows.*2000/i and found a match (although if /windows.*2000/i does not match then of course I cannot skip the test against /windows/i).

Note that none of the REs in my collection are entirely equivalent (for any pair of REs there is at least one text string which matches one and does not match the other).

STRATEGY:

I want to build a directed graph G with a node for each RE in my collection and a directed edge for each pair of REs with an ordering relationship (A -> B means "match against A implies match against B"), and find a "minimal spanning set" of nodes for the graph (minimal set of nodes S such that every node in G lies on a directed path which originates in S).

THE EASY PART:

There are lots of freely available algorithms for working with Directed Acyclic Graphs. So once the graph G is built for my collection of REs (which being distinct should guarantee that G is acyclic) I don't expect to have much difficulty finding an appropriate algorithm for finding a minimal spanning set for G.

WHERE I NEED HELP:

I'd like to find an efficient way to find all the ordering relationships between the REs in my collection - and perhaps also to ensure that no two REs in the collection are equivalent (I will need a way to automatically verify this as new REs are added).

My (essentially random) web searches have thus turned up at least one plausible claim that a reasonable way to compute what (if any) ordering relationship exists between two REs does indeed exist, but have not yet turned up any descriptions of a complete algorithm.

Does anyone know of an existing implementation (for comparing REs) which is reasonably efficient, freely available, and (ideally) implemented either in one of the popular scripting languages or C/C++?

Recommended topics

Hot tags