How to Modify a Suffix Array to search multiple strings?

Asked 17/7, 2015 at 15:2 Answered 2/2, 2021 at 8:14

I've recently been updating my knowledge of algorithms and have been reading up on suffix arrays. Every text I've read has defined them as an array of suffixes over a single search string, but some articles have mentioned its 'trivial' to generalize to an entire list of search strings, but I can't see how.

Assume I'm trying to implement a simple substring search over a word list and wish to return a list of words matching a given substring. The naive approach would appear to be to insert the lexicographic end character '$' between words in my list, concatenate them all together, and produce a suffix tree from the result. But this would seem to generate large numbers of irrelevant entries. If I create a source string of 'banana$muffin' then I'll end up generating suffixes for 'ana$muffin' which I'll never use.

I'd appreciate any hints as to how to do this right, or better yet, a pointer to some algorithm texts that handle this case.

Jemmie answered 17/7, 2015 at 15:2 Comment(4)

Take into account that suffix arrays and suffix trees require Theta(n) time for construction and Theta(n) storage, so complexity-wise, there's no time or space wasted. – Decile 17/7, 2015 at 15:40

From another point of view, any information about a non-useful suffix ana$muffin you store is in fact related to the useful substring ana$, the tail is irrelevant. – Decile 17/7, 2015 at 15:42

The basic idea in building a generalised suffix tree or array is to insert distinct "end characters" $, #, @ etc. between each pair of strings. No character in your input string will ever match any of these "characters", so there is no chance that a substring match can "spill over" the boundary between two strings. – Buttery 17/7, 2015 at 17:42

Yes, the trivial generalization everyone speaks about is to concatenate all words with different maximal sentinel characters in between. Note that suffix 'ana$muffin' helps you to find words like 'a', 'an', 'ana' in the 'banana' word from your list, so it is not irrelevant. – Chestonchest 18/7, 2015 at 10:30

After having now read through most of the book Algorithms on Strings, Trees and Sequences by Dan Gusfield, the answer seems clear.

If you start with a multi-string suffix tree, one of the standard conversion algorithms will still work. However, instead of having getting an array of integers, you end up with an array of lists. Each lists contains one or more pairs of a string identifier and a starting offset in that string.

The resulting structure is still useful, but not as efficient as a normal suffix array.

Jemmie answered 2/2, 2016 at 19:38 Comment(0)

In Suffix-Arrays you usually don't use strings, just one string. That will be the concatenated version of several strings with some endtoken (a different one for every string). For the Suffix Arrays, you use pointers (or the array index) to reference the suffix (only the position for the first token/character is needed). So the space required is the array + for each suffix the pointer. (that is just a pretty simple implementation, you should do more, to get more performance).

In that case you could optimise the sorting algorithm for the suffixes, since you only need to sort those suffixes the pointers are referencing to, till the endtokens. Everything behind the endtoken does not need to be used in the sorting algorithm.

Colby answered 30/7, 2015 at 9:15 Comment(0)

After having now read through most of the book Algorithms on Strings, Trees and Sequences by Dan Gusfield, the answer seems clear.

The resulting structure is still useful, but not as efficient as a normal suffix array.

Jemmie answered 2/2, 2016 at 19:38 Comment(0)

From Iowa State University, taken from Prefix.pdf:

Suffix trees and suffix arrays can be generalized to multiple strings. The generalized suffix tree of a set of strings S = {s1, s2, . . . , sk}, denoted GST(S) or simply GST, is a compacted trie of all suffixes of each string in S. We assume that the unique termination character $ is appended to the end of each string. A leaf label now consists of a pair of integers (i, j), where i denotes the suffix is from string si and j denotes the starting position of the suffix in si . Similarly, an edge label in a GST is a substring of one of the strings. An edge label is represented by a triplet of integers (i, j, l), where i denotes the string number, and j and l denote the starting and ending positions of the substring in si . For convenience of understanding, we will continue to show the actual edge labels. Note that two strings may have identical suffixes. This is compensated by allowing leaves in the tree to have multiple labels. If a leaf is multiply labelled, each suffix should come from a different string. If N is the total number of characters (including the $ in each string) of all strings in S, the GST has at most N leaf nodes and takes up O(N) space. The generalized suffix array of S, denoted GSA(S) or simply GSA, is a lexicographically sorted array of all suffixes of each string in S. Each suffix is represented by an integer pair (i, j) denoting suffix starting from position j in si . If suffixes from different strings are identical, they occupy consecutive positions in the GSA. For convenience, we make an exception for the suffix $ by listing it only once, though it occurs in each string. The GST and GSA of strings apple and maple are shown in Figure 1.2.

Here you have an article about an algorithm to construct a GSA:

Generalized enhanced suffix array construction in external memory

Linguini answered 2/2, 2021 at 8:14 Comment(0)

Recommended topics

Hot tags