I'm reading the block sort algorithm from the Burrows and Wheeler paper. This a step of the algorithm:
Suppose S= abracadabra
Initialize an array W of N words W[0, ... , N - 1], such that W[i] contains the characters S'[i, ... , i + k - 1] arranged so that integer comparisons on the words agree with lexicographic comparisons on the k-character strings. Packing characters into words has two benefits: It allows two prefixes to be compared k bytes at a time using aligned memory accesses, and it allows many slow cases to be eliminated
(Note: S'
is the original S
with k EOF
characters appended to it, k being the number of characters that fit in a machine word (I'm in a 32 bits machine, so k=4
)
EOF = '$'
Correct me if I'm wrong:
S'= abracadabra$$$$
W= abra brac raca acad cada adab dabr abra bra$ ra$$ a$$$
Then, the algorithm says you have to sort the suffix array of S
(named V), by indexing into
the array W
.
I don't fully understand how can you sort suffixes by indexing into W
.
For example: at some point of the sorting, suppose you get two suffixes, i
and j
, and you have to compare them. Since you are indexing into W
, you are checking 4 characters at the time.
Suppose they have both the same first 4 characters. Then, you would have to check, for each suffix their next 4 characters, and you do it by accessing from the 4th position of each suffix in W
.
Is this right? Does this "packing characters into words" really speed things up?
abra brac raca...
) is the type of packing the authors intended? Could it beabra cada...
? Can you provide a little more context for the quote? – Vikiviking