I'd do this by constructing the minimum Deterministic Finite Automaton for the language. If you are starting with a regex, this can be done automatically by Thompson's Construction followed by the Subset Construction and minimization. See this description for example.
With a DFA in hand, you can use something like this algorithm:
Let P = { < START, [""] > } be a set of pairs <State, list of strings>
for n = 0, 1, ... Max
Let P' = {} be a new set
while P is not empty
Remove the pair <s, L> from P
For each transition s -- c --> t in alpha order of c
if t is an accepting state,
output l + c for each string l in L
put <t, L + c> in P' (** i.e. append c to each string in L)
end
Set P = P'
end
Note that the step marked **
needs to be true set insertion, as duplicates can easily crop up.
This is a core algorithm. P
can grow exponentially with output length, but this is just the price of tracking all possibilities for a future output string. The order/size/space constraints you mentioned can be ensured by maintaining sorted order in the lists L
and by cutting off the search when resource limits are reached.
Edit
Here is a toy Java example where I've hard coded the DFA for simple binary floating point literals with optional minus sign. This uses a slightly different scheme than the pseudocode above to get strict sorted order of output and to accomodate character ranges.
import java.util.Comparator;
import java.util.TreeSet;
public class Test{
public static class DFA {
public static class Transition {
final int to;
final char lo, hi; // Character range.
public Transition(int to, char lo, char hi) {
this.to = to;
this.lo = lo;
this.hi = hi;
}
public Transition(int to, char ch) {
this(to, ch, ch);
}
}
// transitions[i] is a vector of transitions from state i.
final Transition [] [] transitions;
// accepting[i] is true iff state i is accepting
final boolean [] accepting;
// Make a fresh immutable DFA.
public DFA(Transition [] [] transitions, boolean [] accepting) {
this.transitions = transitions;
this.accepting = accepting;
}
// A pair is a DFA state number and the input string read to get there.
private static class Pair {
final int at;
final String s;
Pair(int at, String s) {
this.at = at;
this.s = s;
}
}
// Compare pairs ignoring `at` states, since
// they are equal iff the strings are equal.
private Comparator<Pair> emitOrder = new Comparator<Pair>() {
@Override
public int compare(Pair a, Pair b) {
return a.s.compareTo(b.s);
}
};
// Emit all strings accepted by the DFA of given max length.
// Output is in sorted order.
void emit(int maxLength) {
TreeSet<Pair> pairs = new TreeSet<Pair>(emitOrder);
pairs.add(new Pair(0, ""));
for (int len = 0; len <= maxLength; ++len) {
TreeSet<Pair> newPairs = new TreeSet<Pair>(emitOrder);
while (!pairs.isEmpty()) {
Pair pair = pairs.pollFirst();
for (Transition x : transitions[pair.at]) {
for (char ch = x.lo; ch <= x.hi; ch++) {
String s = pair.s + ch;
if (newPairs.add(new Pair(x.to, s)) && accepting[x.to]) {
System.out.println(s);
}
}
}
}
pairs = newPairs;
}
}
}
// Emit with a little DFA for floating point numbers.
public void run() {
DFA.Transition [] [] transitions = {
{ // From 0
new DFA.Transition(1, '-'),
new DFA.Transition(2, '.'),
new DFA.Transition(3, '0', '1'),
},
{ // From 1
new DFA.Transition(2, '.'),
new DFA.Transition(3, '0', '1'),
},
{ // From 2
new DFA.Transition(4, '0', '1'),
},
{ // From 3
new DFA.Transition(3, '0', '1'),
new DFA.Transition(4, '.'),
},
{ // From 4
new DFA.Transition(4, '0', '1'),
}
};
boolean [] accepting = { false, false, false, true, true };
new DFA(transitions, accepting).emit(4);
}
public static void main (String [] args) {
new Test().run();
}
}
[A-Z].
and[A-Z]*
(to within a fixed limit) alone, would be sufficient. – Wedgwooda*b
? Consider thatab
comes beforeb
, andaab
beforeab
, etc. Inductively, we can reason that the first string is an infinite repetition ofa
followed by a singleb
. Clearly we can't represent this, and so we have to set a fixed (i.e. arbitrary) limit on the expansion of wildcards. – Actuaryb*a|c
consists of the following elements in increasing order:a < ba < bba < bbba < ... < c
, where the last elementc
cannot be indexed. Instead, I suggest using "shortlex order": First order by length (shortest first), then lexicographically (with strings of the same length). – Opsonin