Scalability of aho corasick

L

4

13

I want to search a text document for occurrences of keyphrases from a database of keyphrases (extracted from wikipedia article titles). (ie. given a document i want to find whether any of the phrases have a corresponding wikipedia article) I found out about the Aho-Corasick algorithm. I want to know if building an Aho-Corasick automaton for dictionary of millions of entries is efficient and scalable.

Leia answered 27/2, 2011 at 15:4 Comment(0)

L

8

In theory, it should maintain linear speed subject only to the effects of the memory hierarchy - it will slow down as it gets too big to fit in cache, and when it gets really big, you'll have problems if it starts getting paged out.

OTOH the big win with Aho-Corasick is when searching for decent sized substrings that may occur at any possible location within the string being fed in. If your text document is already cut up into words, and your search phrases are no more than e.g. 6 words long, then you could build a hash table of K-word phrases, and then look up every K-word contiguous section of words from the input text in it, for K = 1..6.

(Answer to comment)

Aho-Corasick needs to live in memory, because you will be following pointers all over the place. If you have to work outside memory, it's probably easiest to go back to old-fashioned sort/merge. Create a file of K-word records from the input data, where K is the maximum number of words in any phrase you are interested in. Sort it, and then merge it against a file of sorted Wikipedia phrases. You can probably do this almost by hand on Unix/Linux, using utilities such as sort and join, and a bit of shell/awk/perl/whatever. See also http://en.wikipedia.org/wiki/Key_Word_in_Context (I'm old enough to have actually used one of these indexes, provided as bound pages of computer printout).

Lymanlymann answered 27/2, 2011 at 17:27 Comment(3)

so the tree/hash would have to be completely in memory? i have around 8 million phrases in the dictionary, so a completely in memory Data Structure is difficult i guess.. – Leia 27/2, 2011 at 19:21

in relation to K-Word hash set solution.. if i use a bloom filter of the 8million entry dictionary, can it stay in memory and be fast and efficient? a small false positive rate is acceptable because in later stages of my application i'll be looking up details of the matches, so i can eliminate them.. – Leia 28/2, 2011 at 7:28

That sounds plausible - I thought you might get away with Aho-Corasick on a big enough machine, but I have no idea how big a machine you have and not much feel for the constants involved. The Wikipedia entry en.wikipedia.org/wiki/Bloom_filter gives you a formula at the bottom for the required number of Bloom filter bits to support any given number of entries and false positive rate - put in your size and required false positive rate and see if you can afford the result. – Lymanlymann 5/3, 2011 at 12:20

C

13

Let's just make a simple calculations :

Assume that you have 1 million patterns (strings, phrases) with average length 10 characters and a value (label, token, pointer etc) of 1 word (4 bytes) length , assigned to each pattern

Then you will need an array of 10+4=14 million bytes (14Mb) just to hold the list of patterns.

From 1 million patterns 10 bytes (letters, chars) each you could build an AC trie with no more than 10 million nodes. How big is this trie in practice depends on the size of each node. It should at least keep 1 byte for a label (letter) and word (4bytes) for a pointer to a next node in trie (or a pattern for a terminal node) plus 1 bit (boolean) to mark terminal node, Total about 5 bytes

So, the minimum size of a trie for 1 million patterns 10 chars you will need min 50 million bytes or about 50 Mb of memory.

In practice it might be 3-10 times more , but yet is very-very manageable, as even 500Mb memory is very moderate today. (Compare it with Windows applications like Word or Outlook)

Given that in terms of speed Aho-Corasick (AC) algorithm is almost unbeatable, it still remains the best algorithm for multiple pattern match ever. That's my strong personal educated opinion apart from academic garbage .

All reports of "new" latest and greatest algorithms that might outperform AC are highly exaggerated (except maybe for some special cases with short patterns like DNA)

The only improvement of AC could in practice go along the line of more and faster hardware (multiple cores, faster CPUs, clusters etc)

Don't take my word for it, test it for yourself. But remember that real speed of AC strongly depends on implementation (language and quality of coding)

Cay answered 4/12, 2012 at 16:44 Comment(1)

As it happens, University of Queensland Oncology is using an A-C implementation to search DNA (for cancer indicators) --- DNA search has a tiny alphabet but l-o-n-g patterns. – Crean 25/8, 2018 at 21:13

L

8

In theory, it should maintain linear speed subject only to the effects of the memory hierarchy - it will slow down as it gets too big to fit in cache, and when it gets really big, you'll have problems if it starts getting paged out.

OTOH the big win with Aho-Corasick is when searching for decent sized substrings that may occur at any possible location within the string being fed in. If your text document is already cut up into words, and your search phrases are no more than e.g. 6 words long, then you could build a hash table of K-word phrases, and then look up every K-word contiguous section of words from the input text in it, for K = 1..6.

(Answer to comment)

Aho-Corasick needs to live in memory, because you will be following pointers all over the place. If you have to work outside memory, it's probably easiest to go back to old-fashioned sort/merge. Create a file of K-word records from the input data, where K is the maximum number of words in any phrase you are interested in. Sort it, and then merge it against a file of sorted Wikipedia phrases. You can probably do this almost by hand on Unix/Linux, using utilities such as sort and join, and a bit of shell/awk/perl/whatever. See also http://en.wikipedia.org/wiki/Key_Word_in_Context (I'm old enough to have actually used one of these indexes, provided as bound pages of computer printout).

Lymanlymann answered 27/2, 2011 at 17:27 Comment(3)

so the tree/hash would have to be completely in memory? i have around 8 million phrases in the dictionary, so a completely in memory Data Structure is difficult i guess.. – Leia 27/2, 2011 at 19:21

in relation to K-Word hash set solution.. if i use a bloom filter of the 8million entry dictionary, can it stay in memory and be fast and efficient? a small false positive rate is acceptable because in later stages of my application i'll be looking up details of the matches, so i can eliminate them.. – Leia 28/2, 2011 at 7:28

That sounds plausible - I thought you might get away with Aho-Corasick on a big enough machine, but I have no idea how big a machine you have and not much feel for the constants involved. The Wikipedia entry en.wikipedia.org/wiki/Bloom_filter gives you a formula at the bottom for the required number of Bloom filter bits to support any given number of entries and false positive rate - put in your size and required false positive rate and see if you can afford the result. – Lymanlymann 5/3, 2011 at 12:20

H

1

Well there is a workaround. By writing the built AC trie of dictionary into a text file in a xml-like format, making an index file for the first 6 levels of that trie, etc... In my tests I search for all partial matches of a sentence in the dictionary (500'000 entries), and I get ~150ms for ~100 results for a sentence of 150-200 symbols.

For more details, check out this paper : http://212.34.233.26/aram/IJITA17v2A.Avetisyan.doc

Horowitz answered 7/6, 2011 at 14:42 Comment(0)

C

0

There are other ways to get performance: - condense state transitions: you can get them down to 32 bits. - ditch the pointers; write the state transitions to a flat vector. - pack nodes near the tree root together: they will be in cache. The implementation takes about 3 bytes per char of the original pattern set, and for 32-bit nodes, can take a pattern space of about 10M chars. For 64-bit nodes, have yet to hit (or figure) the limit.

Doc: https://docs.google.com/document/d/1e9Qbn22__togYgQ7PNyCz3YzIIVPKvrf8PCrFa74IFM/view Src: https://github.com/mischasan/aho-corasick

Crean answered 25/8, 2018 at 21:28 Comment(0)

Recommended topics

Hot tags