Aho Corasick algorithm
Asked Answered
B

2

6

I am not able to understand the below algorithm which is used for string pattern matching using Aho-Corasick alg.

Procedure AC(y,n,q0)
INPUT: y<-array of m bytes representing the text input
(SQL Query Statement)
n<-integer representing the text length
(SQL Query Length)
q0<-initial state (first character in pattern)
2: State <-q0
3: For i = 1 to n do
4: While g ( State, y[i] = = fail) do
5: State ← f (State)
6: End While
7: State ← g(State,.y[i])
8: If o(State) 􀂏 then
9: Output i
10: Else
11: Output
12: End If
13: End for
14: End Procedure
Breland answered 2/12, 2013 at 13:0 Comment(2)
What are f, g, and o? This looks like a generic implementation of a state machine that updates the current state based on each new character of the input, then determines if the string ends in an accept state.Alkyd
I've made hand-outs of my lecture on Aho-Corasick available on Slideshare, for anyone who might find them useful: slideshare.net/PekkaKilpelinen2/aho-corasicklectureFructificative
C
26

You probably won't gain a good understanding of the Aho-Corasick algorithm from reading a little bit of pseudocode. Unless you understand the state transition table, the algorithm will make no sense at all.

There's a decent explanation along with an animation at Aho-Corasick implementation and animation.

The original paper, Efficient String Matching: An Aid to Bibliographic Search(PDF), is well written and understandable, and the pseudocode examples are pretty easy to convert to working code. It'll take a little study, but you should have a good understanding after you read the paper, think about it a bit, and then read it again.

Creek answered 2/12, 2013 at 14:42 Comment(2)
what are aho-corasick algorithm advantages?Breland
The advantages are that it can match multiple strings in a single pass over a large body of text, whereas searching for the strings individually using something like the Boyer-Moore search would require multiple passes over the text. Read the Wikipedia article and the referenced links.Creek
E
7

Aho Corasick algorithm is used to solve set matching problem. It means we have a set of strings S, and here comes a long string L to check whether L contains any one in the previous set S.

An basic solution is using a trie tree, i.e. a prefix tree, please see Wikipedia. There are typically two steps to deal with the problem.

  1. Precompute the trie tree based on the string set S;
  2. Scan the string L to check.

The trie tree is easy to understand. It store prefix substrings with nodes starts from the root.

The Aho-Corasick algorithm is an extension of a trie tree, not far from the basic idea. Aho-Corasick algorithm adds a failed pointer to each node on a trie tree.

When fails, a trie tree will restart from the root (add the start index on L by 1), but Aho-Corasick algorithm will restart from the node D pointed by the failed pointer (add the start index on L by the depth of the node D).

Below is a C++ implementation of the Aho-Corasick algorithm. It contains some bugs. http://www.komodia.com/aho-corasick

I fixed the bugs I found. And you can access my version here: https://github.com/elfinhe/KomodiaAhoCorasick

Embowed answered 2/12, 2013 at 15:51 Comment(1)
if these algorithm was not there then what we had to do?Breland

© 2022 - 2024 — McMap. All rights reserved.