Aho Corasick algorithm is used to solve set matching problem. It means we have a set of strings S, and here comes a long string L to check whether L contains any one in the previous set S.
An basic solution is using a trie tree, i.e. a prefix tree, please see Wikipedia. There are typically two steps to deal with the problem.
- Precompute the trie tree based on the string set S;
- Scan the string L to check.
The trie tree is easy to understand. It store prefix substrings with nodes starts from the root.
The Aho-Corasick algorithm is an extension of a trie tree, not far from the basic idea. Aho-Corasick algorithm adds a failed pointer to each node on a trie tree.
When fails, a trie tree will restart from the root (add the start index on L by 1), but Aho-Corasick algorithm will restart from the node D pointed by the failed pointer (add the start index on L by the depth of the node D).
Below is a C++ implementation of the Aho-Corasick algorithm. It contains some bugs.
http://www.komodia.com/aho-corasick
I fixed the bugs I found. And you can access my version here:
https://github.com/elfinhe/KomodiaAhoCorasick
f
,g
, ando
? This looks like a generic implementation of a state machine that updates the current state based on each new character of the input, then determines if the string ends in an accept state. – Alkyd