How does the CYK algorithm work?
Asked Answered
A

1

5

I have to check if a string can be derived from a given context free that is in Chomsky normal form. I'm using C++.

There is very nice pseudocode on the Wikipedia article covering the CYK algorithm, but I can't understand it very well.

Would someone be so kind to help me out by giving me another pseudocode for CYK algorithm, or maybe explain the one in the wiki article?

Acquiescence answered 5/12, 2012 at 17:7 Comment(7)
As much as I like Wikipedia, it isn't always the most readable source. For technical information for the uninitiated, it's usually best to seek alternate sources. Have you googled other locations for CYK?Charin
i've done a Google search but I either turn up actual code done by someone on a level that I can't understand, or I find the algorithm for doing it by hand which I have had a had time even beginning to trandlate to code.Acquiescence
Yeah, a lot of the links are not very readable. There's a demo if you want to get familiar with it at: diotavelli.net/people/void/demos/cky.html. Additionally, here's a series of slides that seems more readable: cs.ucdavis.edu/~rogaway/classes/120/winter12/CYK.pdf. Finally, here's a C++ implementation: nitishkr.wordpress.com/2011/03/29/cyk-algorithm-implementationCharin
I would like to add that this is all stuff I have found earlier but I gave it a reread since you found it too. I'm not much better off though :/Acquiescence
Sorry to hear that... I think I understand the CYK algorithm, the problem is that it's not necessarily easy to explain as it may need grounding in some other terminology that isn't necessarily CS. If you want to take this to chat, maybe I can help, but it's not anything suitable for an answer as it would involve a back & forthCharin
Actually I figured it out! After watching that demo over and over again I came up with the for loops that mimic that pattern of checking and it all came together beautifully :)Acquiescence
Wonderful! I'm glad to hear it.Charin
H
7

The CYK algorithm takes as input a CFG that's in Chomsky normal form. That means that every production either has the form

  • S → a, for some terminal a, or
  • S → AB, for some nonterminals A and B.

Now, imagine you have a string w and you want to see whether you can derive it from a grammar whose start symbol is S. There are two options:

  1. If w is a single character long, then the only way to parse it would be to use a production of the form S → a for some character a. So see whether any of the single-character productions would match a.
  2. If w is more than one character long, then the only way to parse it is to use a production of the form S → AB for some nonterminals A and B. That means that we need to divide the string w into two pieces x and y where A derives x and B derives y. One way to do that is to try all possible ways of splitting w into two pieces and to see if any of them work.

Notice that option (2) here ends up being a recursive parsing problem: to see whether you can derive w from S, see whether you can derive x from A and y from B.

With that insight, here's pseudocode for recursive function you can use to see whether a nonterminal S derives a string w:

bool canDerive(nonterminal S, string w) {
    return canDeriveRec(S, w, 0, w.size());
}

/* Can you derive the substring [start, end) of w from S? */
bool canDeriveRec(nonterminal S, string w, int start, int end) {
    /* Base case: Single characters */
    if (end - start == 1) {
        return whether there is a production S -> a, where a = w[start];
    }

    /* Recursive case: Try all possible splits */
    for (each production S -> AB) {
        for (int mid = start + 1; mid < end; mid++) {
            if (canDeriveRec(A, w, start, mid) &&
                canDeriveRec(B, w, mid, end)) {
                return true;
            }
        }
    }
    return false;
}

This algorithm works correctly, but if you map out the shape of the recursion you'll find that

  1. it makes a ton of redundant recursive calls, but
  2. there aren't that many different possible recursive calls.

In fact, the number of distinct possible calls is O(n2 N), where n is the length of the input string (for each possible combination of a start and end index) and N is the number of nonterminals in the grammar. These observations suggest that this algorithm would benefit either from memoization or dynamic programming, depending on which approach you happen to think is nicer.

The CYK algorithm is what you get when you take the above recursive algorithm and memoize the result, or equivalently when you convert the above recursive algorithm into a dynamic programming problem.

There are O(n2 N) possible recursive calls. For each production tried, it does O(n) work. If there are P productions, on average, for a nonterminal, this means the overall runtime is O(n3 NP), which is O(n3) for a fixed grammar.

Heuristic answered 19/7, 2017 at 21:19 Comment(7)
Thanks for your great explanation, just wondering if thats a bottom up or top down design as I am bit confused?Fleeman
It's a little bit of both. If you think of it as a dynamic programming algorithm, then it's a bottom-up approach that determines all possible productions that could yield each of the different substrings in increasing order of length. If you think of it as a memoization approach, then it's top-down. Many parsing algorithms have that feel of mixing and matching.Heuristic
Thanks for your reply, the pseudocode here en.wikipedia.org/wiki/CYK_algorithm#As_pseudocode is bottom up approach and I am wondering what shall I change to make it top down design?Fleeman
The pseudocode I've given in my answer is a good template for how you'd implement things top-down. Just add memoization and you're done! (Also, while CYK is commonly taught as a good general parsing algorithm, Earley's algorithm is way more flexible - it doesn't need things to be in CNF - and in practice can be a lot faster. One of my TAs implemented it as part of a CFG autograding tool and we've had lots of successes with it.)Heuristic
Thank you, I will add memorization to your answer, and I will add a new question with my new code so it will be great if you could please check it, is that alright?Fleeman
I'd prefer if you left this answer unmodified or let me take a stab at it, but feel free to post your own question!Heuristic
It will be great if you could please update it by adding memorization, as the new question will be almost the sameFleeman

© 2022 - 2024 — McMap. All rights reserved.