How does the CYK algorithm work?

The CYK algorithm takes as input a CFG that's in Chomsky normal form. That means that every production either has the form

S → a, for some terminal a, or
S → AB, for some nonterminals A and B.

Now, imagine you have a string w and you want to see whether you can derive it from a grammar whose start symbol is S. There are two options:

If w is a single character long, then the only way to parse it would be to use a production of the form S → a for some character a. So see whether any of the single-character productions would match a.
If w is more than one character long, then the only way to parse it is to use a production of the form S → AB for some nonterminals A and B. That means that we need to divide the string w into two pieces x and y where A derives x and B derives y. One way to do that is to try all possible ways of splitting w into two pieces and to see if any of them work.

Notice that option (2) here ends up being a recursive parsing problem: to see whether you can derive w from S, see whether you can derive x from A and y from B.

With that insight, here's pseudocode for recursive function you can use to see whether a nonterminal S derives a string w:

bool canDerive(nonterminal S, string w) {
    return canDeriveRec(S, w, 0, w.size());
}

/* Can you derive the substring [start, end) of w from S? */
bool canDeriveRec(nonterminal S, string w, int start, int end) {
    /* Base case: Single characters */
    if (end - start == 1) {
        return whether there is a production S -> a, where a = w[start];
    }

    /* Recursive case: Try all possible splits */
    for (each production S -> AB) {
        for (int mid = start + 1; mid < end; mid++) {
            if (canDeriveRec(A, w, start, mid) &&
                canDeriveRec(B, w, mid, end)) {
                return true;
            }
        }
    }
    return false;
}

This algorithm works correctly, but if you map out the shape of the recursion you'll find that

it makes a ton of redundant recursive calls, but
there aren't that many different possible recursive calls.

In fact, the number of distinct possible calls is O(n² N), where n is the length of the input string (for each possible combination of a start and end index) and N is the number of nonterminals in the grammar. These observations suggest that this algorithm would benefit either from memoization or dynamic programming, depending on which approach you happen to think is nicer.

The CYK algorithm is what you get when you take the above recursive algorithm and memoize the result, or equivalently when you convert the above recursive algorithm into a dynamic programming problem.

There are O(n² N) possible recursive calls. For each production tried, it does O(n) work. If there are P productions, on average, for a nonterminal, this means the overall runtime is O(n³ NP), which is O(n³) for a fixed grammar.

Recommended topics

Hot tags