Trie vs. suffix tree vs. suffix array
Asked Answered
F

7

46

Which structure provides the best performance results; trie (prefix tree), suffix tree or suffix array? Are there other similar structures? What are good Java implementations of these structures?

Edit: in this case I want to make string matching between a large dictionary of names and a large set of natural language texts, in order to identify the names of the dictionary on texts.

Folklore answered 21/3, 2010 at 15:26 Comment(1)
Best performance for what operations?Brakpan
I
68

The trie was the first data structure of this kind discovered.

The suffix tree is an improvement over the trie (it has suffix links which allow linear error search, the suffix tree trims unnecessary branches of the trie therefore it does not require as much space).

The suffix array is a stripped down data structure based on the suffix tree (no suffix links (slow error matches), yet pattern matching is very fast).

The trie is not for real world use because it consumes too much space.

The suffix tree is lighter and faster than the trie and is used to index DNA or optimize some large web search engines.

The suffix array is slower in some pattern searches than the suffix tree but uses less space, and is more widely used than the Suffix tree.

In the same family of data structures:

There are other implementations, the CST is an implementation of the suffix tree using a suffix array and some additional data structures to get some of the suffix tree search capabilities.

The FCST takes it further, it implements a sampled suffix tree with a suffix array.

The DFCST is a dynamic version of the FCST.

Expanding:

The two important factors are space use and operation execution time. You might think that with modern day machines this is not relevant but to index the DNA of a single human being would require 40 gigabytes of memory (using an uncompressed and unoptimized suffix tree). And to build one of this indexes over this much data can take days. Imagine Google, it has lots of searchable data, they need a large index over all web data and they do not change it every time someone builds a web page. They have some form of caching for that. However the main index is probably static. And every couple of weeks or so they gather all new web sites and data and build a new index, replacing the old when the new is finished. I do not know which algorithm they use to index, but it is probably a suffix array with suffix tree properties over a partitioned database.

The CST uses 8 gigabytes, however the suffix tree operations speed are heavily reduced.

The suffix array can do the same in some 700 megas to 2 Gigas. However you will not find genetic errors in the DNA with a suffix array (meaning: searching for a pattern with a wildcard is much much slower).

The FCST (fully compressed suffix tree) can create a suffix tree in 800 to 1.5 gigas. With a rather small speed deterioration towards the CST.

The DFCST uses 20% more space than the FCST, and loses speed to the static implementation of the FCST (however a dynamic index is very important).

There are not many viable (space wise) implementations of the suffix tree because it is very hard to make the operations speed boost compensate the data structures RAM space cost.

This said, the suffix tree has very interesting search results for pattern matching with errors. The aho corasick is not as fast (though nearly as fast for some operations, not error matching) and the boyer moore is left in the dust.

Inscription answered 17/7, 2011 at 10:43 Comment(3)
Linear error search is a error search which returns all possible error matches in linear time. For example, a text has the words "House", "Housa", "Hotse" somewhere. A constant error match would be returning all errors in one operation. Linear error match returns all errors (matches) in COUNT(matches). In this case 2. Some might interpret this as a linear search on the size of the text (text scan for the error), and therefore the cost would be equal to the size of the text. Which is the case of nearly all error search algorithms, however its not the case for the suffix tree.Inscription
The advantage of suffix array is uses less space than suffix tree. But how can we know that ? Is there any mathematical proof or we base on the practical experiments ?Metalware
It is based in mathematical proof and practical experiments. The compressed suffix tree proposed by Sadakane in 2007 uses nHk + 6n + o(n) of space, where nHk is the suffix array. So you get a perspective of relative space use, where SA is actually a part of the implementation of the ST. There are several implementations that prove real space use as well. You can see his proposal or mine. And Sadakane here: Kunihiko Sadakane.Theor. Comp. Sys., 41(4):589–607, 2007. Mine : An implementation of dynamic fully compressed suffix trees. By Miguel Figueiredo hdl.handle.net/10362/4328Inscription
G
4

What operations do you plan on doing? libdivsufsort was at one time the best suffix array implementation in C.

Gahnite answered 31/3, 2011 at 1:1 Comment(2)
What is the technique of achieving that efficiency for the suffix array construction?Angelia
Timely question. AmpLab just released a parallel version with in depth explaination, amplab.cs.berkeley.edu/publication/…Gahnite
H
2

Using Suffix Trees you can write something that will match your dictionary to your text in O(n+m+k) time where n is letters in your dictionary, m is letters in your text, and k is the number of matches. Tries are much slower for this. I'm not sure what a Suffix Array is, so I can't comment on that.

That said, it's non-trivial to code and I don't happen to know of any Java libraries that provide the necessary functions.

Hubie answered 21/3, 2010 at 16:12 Comment(1)
Yeah, I've been reading up on Suffix Arrays. Turn out they have the same asymptotic speed as Suffix trees but are more space efficient. They're definitely an option.Hubie
S
1

EDIT: In this case I want to make string matching between a large dictionary of names and a large set of natural language texts, in order to identify the names of the dictionary on texts.

This sounds like an application for the Aho-Corasick algorithm: construct an automaton from the dictionary (in linear time), which can then be used to find all the occurrences of any of the dictionary words in multiple texts (also in linear time).

(The description in these lecture notes, linked from the "External links" section of the Wikipedia page, is a lot easier to read than the description on the page itself.)

Stuartstub answered 21/3, 2010 at 17:51 Comment(0)
C
1

I prefer Suffix Auto Machine. You can find more details through my website: http://www.fogsail.net/2019/03/06/20190306/

enter image description here

first, If you used normal construction, it will takes O(n^2) to travel all the suffix

We use radix-sort to sort the suffix Array by first character.

But, if we sort the first character, we can use the information.

Details are showed by the images (neglect Chinese)

We sort array by the first-key, the result is presented by the red rectangle

𝑎𝑎𝑏𝑎𝑎𝑎𝑎𝑏,𝑎𝑏𝑎𝑎𝑎𝑎𝑏,....−−>𝑎𝑎𝑏𝑎𝑎𝑎𝑎𝑏,𝑎𝑎𝑎𝑎𝑎𝑎𝑏,....

enter image description here

#include <iostream>
#include <cstdio>
#include <vector>
#include <queue>
#include <cstring>
#include <algorithm>

using namespace std;
const int maxn = 1001 * 100 + 10;

struct suffixArray {
    int str[maxn], sa[maxn], rank[maxn], lcp[maxn];
    int c[maxn], t1[maxn], t2[maxn];
    int n;

    void init() { n = 0; memset(sa, 0, sizeof(sa)); }

    void buildSa(int Rdx) {
        int i, *x = t1, *y = t2;
        for(i = 0; i < Rdx; i++) c[i] = 0;
        for(i = 0; i < n; i++) c[x[i] = str[i]]++;
        for(i = 1; i < Rdx; i++) c[i] += c[i-1];
        for(i = n-1; i >= 0; i--) sa[--c[x[i]]] = i;

        for(int offset = 1; offset <= n; offset <<= 1) {
            int p = 0;
            for(i = n-offset; i < n; i++) y[p++] = i;
            for(i = 0; i < n; i++) if(sa[i] >= offset) y[p++] = sa[i] - offset;

            // radix sort
            for(i = 0; i < Rdx; i++) c[i] = 0;
            for(i = 0; i < n; i++) c[x[y[i]]]++;
            for(i = 1; i < Rdx; i++) c[i] += c[i-1];
            for(i = n-1; i >= 0; i--) { sa[--c[x[y[i]]]] = y[i]; y[i] = 0; }

            // rebuild x and y
            swap(x, y); x[sa[0]] = 0; p = 1;
            for(i = 1; i < n; i++)
                x[sa[i]] = y[sa[i-1]] == y[sa[i]] && y[sa[i-1]+offset] == y[sa[i]+offset] ? p-1 : p++;
            if(p >= n) break;
            Rdx = p;
        }
    }
Conferral answered 19/5, 2019 at 9:20 Comment(0)
D
0

This implementation of the induced sorting algorithm (called sais) has a Java version for constructing suffix arrays.

Dhu answered 15/7, 2011 at 20:56 Comment(0)
E
0

Trie vs Suffix tree

both data structure ensure a very fast look up, the time of search is proportional to the lenght of the query word, complexity time O(m) where m is the lenght of the query word.

it's mean if we have query word that have 10 chars, so we need at most 10 steps to find it.

Trie : A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.

suffix tree: A compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents.

def are from : Dictionary of Algorithms and Data Structures

generally Trie used to index dictionary words (lexicon) or any sets of strings example D={abcd, abcdd, bxcdf,.....,zzzz }

a suffix tree used to index text by using the same data structure "Trie" on all suffixes of our text T=abcdabcg all suffixes of T = {abcdabcg , abcdabc , abcdab, abcda, abcd, abc , ab, a}

now it look like a groups of strings. we build a Trie over over this groups of strings (all suffixes of T).

the construction of both data structure is in linear, it take O(n) in time and space.

in case of dicionary (a set of strings): n = the sum of the characters of all the words. in text : n = length of text.

suffix array : is a technic to represent a suffix tree in compressed sapce, it's an array of all starting positions of suffixes of a string.

it's slower than suffix tree in search time.

for more information go to wikipedia , there is a good article talking on this topic.

Engrossing answered 17/3, 2014 at 13:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.