How does this code for obtaining LCP from a Suffix Array work?

First, it's important to realize that the algorithm processes the suffixes in the original order, i.e. the order in which they appear in the input string. Not in lexicographical order.

So if the input string is abxabc, it first considers abxabc, then bxabc, then xabc and so forth.

For each suffix it considers in this order, it determines the position of the suffix that is its lexicographical predecessor^(*) (so here, and only here, it uses the concept of lexicographical order). For the first suffix abxabc, the lexicographical predecessor, i.e. the suffix that appears directly before it in the lexicographical ordering of suffixes, is abc. It determines this by an O(1) lookup in the array C, which has been prepared specifically for this purpose.

The inner loop compares the characters of abxabc and abc one by one and determines that these two suffixes have the first 2 characters in common. This is the variable l in your code, and it means that the entry in LCP for the suffix abxabc must be 2, so we set LCPadj[i] = l. Note that i here refers to the position of the suffix in the input string, not to its position in the suffix array. So LCPadj is not the LCP array (yet). It's an auxiliary data structure.

Then it proceeds to the next string, which is bxabc. Again it uses C to find that bc is the lexicographical predecessor of that and then determines how many prefix characters the two share. And here comes the trick: It can be sure that this must be at least as many as in the previous step (i.e. 2), minus 1. Why? Because the string we currently consider, bxabc, is of course a suffix of the string previously considered (abxabc), therefore the lexicographical predecessor of that string (abc) must also have a suffix that is 1 character shorter (bc), and that suffix must also be somewhere in the suffix array, and it must share its prefix with the currently considered string, minus the first character. Moreover, there can't be any other suffix that is both shorter and lexicographically closer to the string currently considered. The latter is quite logical if you think about how lexicographical sorting works, but there are also formal proofs of this (e.g. Lemma 5.10 in Kärkkäinen's lecture here)

So that describes the main principle at work here. There are a few things to note about your code to fully understand the role of each variable:

As explained, C is an auxiliary array (n integers in length) that stores, for each suffix in the input string, the position of that other suffix that is its immediate lexicographical predecessor. This array is constructed not from left to right, but (wisely) by going through the suffix array from left to right, because that makes it easy to determine the immediate lexicogaphical predecessor of any string: The immediate lexicographical predecessor of the suffix starting at position suffixArr[i] must of course be located at position suffixArr[i-1]. Confirm in your code that this is how C is defined.
As mentioned above, LCPadj stores LCP values for the suffix in the order in which they appear in the input string, not the order in which they appear in the suffix array. That is why at output time, LCPadj is not printed from left to right, but by going through the suffix array from left to right, and printing LCPadj[i] in that order. Confirm that this is the case.

I hope this helps. Let me know if not.

^(*)By lexicographical predecessor I mean the immediate predecessor of the suffix in the lexicographically ordered list of suffixes, i.e. the suffix to its immediate left in the suffix array.

Recommended topics

Hot tags