Suffix array nlogn creation
Asked Answered
A

3

11

I have been learning suffix arrays creation, & i understand that We first sort all suffixes according to first character, then according to first 2 characters, then first 4 characters and so on while the number of characters to be considered is smaller than 2n.

But my doubt is why don't we choose the first 3 characters, then 9... and so on. Why only 2 characters are taken into account since the strings are a part of same strings and not different random strings?

Acknowledgment answered 17/7, 2016 at 7:48 Comment(1)
It is the same as why binary search divide by half instead of dividing into 3 parts. 2 is chosen because it is simpler. In terms of complexity, both are the same.Blower
P
6

I haven't analyzed the suffix array construction algorithm thoroughly, but still would like to share my thoughts.

In my humble opinion, your question is similar to the following ones:

  • Why do computers use binary encoding of information instead of ternary?

  • Why does binary search bisect the range instead of trisecting it?

  • Why are there two sexes rather than three?

The reason is that the number 2 is special - it is the smallest plural number. The difference between 1 and 2 is qualitative, whereas the difference between 2 and 3 (as well as any other positive integer) is quantitative and therefore not as drastic.

As a result, binary formulation of many algorithms and data structures turns out to be the simplest one, though some of them may be generalized, with various degrees of added complexity, for an arbitrary base.

Pincer answered 19/7, 2016 at 15:41 Comment(0)
I
2

Answer is given from the post you linked. And as @Leon answered, the algorithm work because it use a dichotomous approach to solve the sorting problem. if you correctly read the answer, the main purpose is to divide word be small 2 character fragments. So that 4 characters can be easily sort base on the arrangement of the 2 pair of characters, 6 characters with 4-2 or 2-4 or 2-2-2 and so one. Thus have a word of 3 letters in the table is non-sense since word of 3 characters may be seen has 2 characters + the position in the alphabet of the last character.

Itu answered 21/7, 2016 at 8:43 Comment(0)
B
2

I think you are considering only the speed of 2^x versus 3^x where you obviously would prefer the latter. But you have to consider the effort you need for each step. Since 3^x needs about 1.58 less steps than 2^x you would need to be able to compute a single step for the 3^x growth in less than 1.58 times what you need for a single step in the 2^x growth to perform better. Generally the problems will get much more complex when you have to handle three elements in each step instead of two. Also if you could expand it to 3^x you could also do it for a bigger n^x and then with big n your algorithm is suddenly not exponential but effectively linear.

Buckley answered 21/7, 2016 at 9:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.