Aho-Corasick and Proper Substrings
Asked Answered
T

3

7

I'm trying to understand the aho-corasick string match algorithm. Suppose that our patterns are abcd and bc. We end up a tree like this

       []
       /\
    [a]..[b]
    /  :  |
   [b].: [c]
    |     :
   [c].....
    |
   [d]

The dotted line show the failure function.

Now supposed that we feed in the string abcd. This will follow the tree and detect the match "abcd," however, as far as I can tell the match bc will not be reported. Am I misunderstanding the algorithm?

Thain answered 22/3, 2011 at 16:4 Comment(0)
A
3

Artem's answer is correct, but perhaps not very clear. Basically what you need to do is: every time you arrive at a new node, examine the whole path starting from this node and consisting of failure links and find matches on this path. (This doesn't change your current position). In your example you would examine the paths b->b (no matches found) and c->c (the match bc is found).

Note that for efficiency reasons you need to cache the value of 'next match' at each node. That is, if you examine the path starting at node u and find a matching node v after several steps, remember the value next_match[u] = v so that next time when you have to examine this path, you could make a single jump straight to v.

Aculeus answered 22/3, 2011 at 19:36 Comment(12)
Unless I'm misundertanding, this is not the Aho-Corasick algorithm. What you're describing is essentially a NFA.Modular
@spinning_plate How is it NFA if there is only one current node at each step?Aculeus
"Every time you arrive at a new node, examine the whole path starting from this node and consisting of failure links and find matches on this path" - You're traversing up the failure paths while maintaining the same position to be continued later. This produces the same output if you had evaluated the trie like a NFA. Can you provide a reference to an Aho-Corasick implementation that evaluates the failure links when there is no failure?Modular
I can see how to fix the algorithm. But I don't see either documentation of the problem or any fix in the versions of Aho-Corasick that I've looked at.Thain
@spinning_plate I still don't see where you got NFA from. If you mean FSM, then yes, Aho-Corasick can be viewed as a kind of FSM, just like Knut-Morris-Pratt. What's your point? I can't link to an implementation, but you can read the analysis at the wikipedia page. (In particular note this: "In general, more than one dictionary suffix link may need to be followed, as more than one dictionary entry may end at a given character in the input.")Aculeus
Well, here's an implementation: scarff.id.au/blog/2010/aho-corasick-string-matching-in-haskell (There is a link to the full source code there)Aculeus
Further explain what I mean by NFA/FSM, one way to approach this algorithm is keep a set of all 'alive' states (which always includes the empty string). At each letter, see if there is a transition from an 'alive' state to another state, if so replace it with that state. Remove the previous state, etc. Print out the state any time an accept state is alive.Modular
@Aculeus - can you explain how this implementation is different from the one I referenced? (sorry, not fluent in haskell)Modular
@spinning_plate Yes, that's what Aho-Corasick does. (There is only one 'alive' state at each point, though.) And I haven't read your link, may be it's the same.Aculeus
@Aculeus - The implemention of the NFA is different, though. In the NFA version, (going above) we get (a)->(b left),(b right)->(c left, cright) output c-> (d) output d. Every implementation of Aho I've looked at has the output of just what the OP said, not bc as well. See my psuedocode as well.Modular
@spinning_plate Well then what you saw were wrong implementations because they didn't solve the problem. See also: github.com/jmhsieh/aho-corasick/tree/… github.com/aurelian/ruby-ahocorasick/blob/… (These are slightly different because they precompute output states.)Aculeus
You're right. Thanks for the links, those let me figure out what I was doing wrong.Modular
S
3

You have to mark nodes as really_final if there is a string ended in this node. In your example such nodes are "abcd" and "bc". After it you have to calc final states for nodes: node is final if it's really_final or node by failure function is final. So, "abcd", "bc" and "abc" will be final.

In other words - node is final if some pattern is ended here or some final node is reacheable from current node walking by failure links.

Scalenus answered 22/3, 2011 at 18:22 Comment(0)
A
3

Artem's answer is correct, but perhaps not very clear. Basically what you need to do is: every time you arrive at a new node, examine the whole path starting from this node and consisting of failure links and find matches on this path. (This doesn't change your current position). In your example you would examine the paths b->b (no matches found) and c->c (the match bc is found).

Note that for efficiency reasons you need to cache the value of 'next match' at each node. That is, if you examine the path starting at node u and find a matching node v after several steps, remember the value next_match[u] = v so that next time when you have to examine this path, you could make a single jump straight to v.

Aculeus answered 22/3, 2011 at 19:36 Comment(12)
Unless I'm misundertanding, this is not the Aho-Corasick algorithm. What you're describing is essentially a NFA.Modular
@spinning_plate How is it NFA if there is only one current node at each step?Aculeus
"Every time you arrive at a new node, examine the whole path starting from this node and consisting of failure links and find matches on this path" - You're traversing up the failure paths while maintaining the same position to be continued later. This produces the same output if you had evaluated the trie like a NFA. Can you provide a reference to an Aho-Corasick implementation that evaluates the failure links when there is no failure?Modular
I can see how to fix the algorithm. But I don't see either documentation of the problem or any fix in the versions of Aho-Corasick that I've looked at.Thain
@spinning_plate I still don't see where you got NFA from. If you mean FSM, then yes, Aho-Corasick can be viewed as a kind of FSM, just like Knut-Morris-Pratt. What's your point? I can't link to an implementation, but you can read the analysis at the wikipedia page. (In particular note this: "In general, more than one dictionary suffix link may need to be followed, as more than one dictionary entry may end at a given character in the input.")Aculeus
Well, here's an implementation: scarff.id.au/blog/2010/aho-corasick-string-matching-in-haskell (There is a link to the full source code there)Aculeus
Further explain what I mean by NFA/FSM, one way to approach this algorithm is keep a set of all 'alive' states (which always includes the empty string). At each letter, see if there is a transition from an 'alive' state to another state, if so replace it with that state. Remove the previous state, etc. Print out the state any time an accept state is alive.Modular
@Aculeus - can you explain how this implementation is different from the one I referenced? (sorry, not fluent in haskell)Modular
@spinning_plate Yes, that's what Aho-Corasick does. (There is only one 'alive' state at each point, though.) And I haven't read your link, may be it's the same.Aculeus
@Aculeus - The implemention of the NFA is different, though. In the NFA version, (going above) we get (a)->(b left),(b right)->(c left, cright) output c-> (d) output d. Every implementation of Aho I've looked at has the output of just what the OP said, not bc as well. See my psuedocode as well.Modular
@spinning_plate Well then what you saw were wrong implementations because they didn't solve the problem. See also: github.com/jmhsieh/aho-corasick/tree/… github.com/aurelian/ruby-ahocorasick/blob/… (These are slightly different because they precompute output states.)Aculeus
You're right. Thanks for the links, those let me figure out what I was doing wrong.Modular
H
0

Part of setting up the AhoCorasick tree is setting up the pointers that tell you where to go in the tree if the next character is not a match. E.g. if you follow the sequence abcq in the tree as you have drawn it, you need to jump from the abc position to the bc position to see if there is a q below bc. You can use this pass to set up another set of pointers to tell it to signal a match on bcd after signalling a match on abcd and so on.

In writing it, I found the source of sgrep at http://www.cs.helsinki.fi/~jjaakkol/sgrep.html very helpful. As far as I can remember, sgrep does a LOT more than just AhoCorasick, but has an AhoCorsick implementation as part of it.

Haight answered 22/3, 2011 at 18:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.