How to enumerate the strings of a context-free grammar?

Here's an obvious but somewhat inefficient algorithm:

Construct R, the Earley parser for the grammar.

For each string S in A* (A is the alphabet for the grammar):
  If R recognizes S:
    Output S

Here I skip the details of constructing R -- see, for example, Earley's thesis or, more concisely, the Wikipedia article on the Earley algorithm. I also skip the problem of enumerating all strings in A*, which is a simple base |A| counter.

Clearly, this algorithm can be made more efficient by using the Earley parser itself to avoid (some) dead-ends. Instead of enumerating all strings in A*, we start with a queue of <string, state-set> tuples, initialized to the tuple consisting of the empty string and the empty state set. We then (infinitely) remove one tuple from the head of the queue and add to the end of the queue all possible tuples which can be constructed by feeding one symbol from A into the Earley parser (typically, the parser will not be able to handle every symbol; in fact, it may not be able to handle any). If the parser recognizes the string in the tuple, we output it.

In both cases, if we know that the given grammar belongs to some more easily parseable subset of CFGs, we could substitute the Earley parser for a more efficient parser for the grammar.

Both of the above algorithms have the advantage of generating the strings of the language in a simple predictable order: first by length, and within a given length, lexicographically, which guarantees that each string will be generated exactly once even if the grammar is ambiguous.

Another solution is to construct the strings in order by the number of reductions required; in effect, this generates all (leftmost) reductions. Here we start a queue with the start symbol, and then repeatedly:

Remove the first sentence in the queue.
If it contains only terminals, output it.
Otherwise, for each production for the first non-terminal in the sentence,
  append to the queue the result of expanding that production.

The above algorithm will work fine for unambiguous grammars but given an ambiguous grammar, it will generate sentences multiple times (once per leftmost derivation). To fix that problem, we could first convert the grammar to Chomsky Normal Form (see link for algorithm). We then create a total ordering for terminals and non-terminals in which the non-terminals all precede the terminals, and a corresponding order for sentences in which a shorter sentence precedes a longer sentence, and equal-length sentences are ordered lexicographically. Then we use the above algorithm, but using a priority queue instead of a FIFO queue, and eliminate duplicate entries before processing them.

In CNF, with the length-then-lexicographic ordering, all productions are increasing, since they either replace a non-terminal with a terminal, or they make the sentence one symbol longer. (The rest of the proof of correctness is by induction.) Consequently, the fully-derived sentences will be enumerated in length-then-lexicographic order, just like the naive algorithm which started this answer.

Recommended topics

Hot tags