Aho-Corasick text matching on whole words?

Asked 21/1, 2013 at 18:12 Answered 11/10, 2020 at 22:47

Solved java algorithm full-text-search string-matching aho-corasick

I'm using Aho-Corasick text matching and wonder if it could be altered to match terms instead of characters. In other words, I want the the terms to be the basis of matching rather than the characters. As an example:

Search query: "He",

Sentence: "Hello world",

Aho-Corasick will match "he" to the sentence "hello world" ending at index 2, but I would prefer to have no match. So, I mean by "terms" words rather than characters.

Rusel answered 21/1, 2013 at 18:12 Comment(1)

What do you mean by "terms?" Can you give an example? – Fervid 21/1, 2013 at 18:46

One way to do this would be to use Aho-Corasick as usual, then do a filtering step where you eliminate all false positives. For example, every time you find a match, you can confirm that the next and previous characters in the input are non-letter characters like spaces or punctuation. That way, you get the speed of the Aho-Corasick lookup, but only consider matches that appear as whole words in the text.

Hope this helps!

Fervid answered 21/1, 2013 at 19:13 Comment(1)

That's what I thought but while checking the next letter is trivial, checking the previous one in a UTF-8 encoded string is actually fairly complex because you cannot just go one byte back but have to take into account the complete variable-length encoding of that prior character. – Keefe 25/7, 2023 at 16:27

One possibility would be to include the space character in your search term, possibly after pre-processing your input to convert all sorts of white space (space, line feed, carriage return, tab...) to the same space character.

Another possibility would be to think of the characters of your alphabet, as far as Aho-Corasick is concerned, as being words. Aho-Corasick will work just as quickly (if not more quickly) with an alphabet of size 2^32 where each word seen in the input text is encoded as a single character, as it will with an alphabet of size 2^8 where a character is just a single byte, as usual.

In either case you have to make a decision about what your pre-processing does with punctuation.

Kuomintang answered 21/1, 2013 at 20:5 Comment(0)

If you use the method onlyWholewords() it should have no results for your example above. For example:

Trie trie = Trie.builder()
             .onlyWholeWords()
             .addKeyword("He")
             .build();
Collection<Emit> emits = trie.parseText("Hello World");

emits in this case will be empty.

It will only reult in whole words only that are "he".

Although beware of characters that are not [a-z A-Z]. for example if you:

"He//Is"

It would pick up the "He" and ignores the "//"

Two things to add:

if you want to assert word boundary, you can use:

onlyWholeWordsWhiteSpaceSeparated() instead of

onlyWholeWords()
If you want to "white-list" some characters, this read might be helpful:

The word characters used are the default ones modified by the ones provided and boolean flags signal where characters are turned on and off. This is useful when you just want to turn off a specific character in the set of default characters. For example:

The word characters used are the default ones modified by the ones provided and boolean flags signal where characters are turned on and off. This is useful when you just want to turn off a specific character in the set of default characters. For example:

new WholeWordMatchSet(keywords, true, ['_', '='], [false, true])

Will produce a set where letters and digits and - and = are considered word characters, but not _.

Brandabrandais answered 11/10, 2020 at 22:47 Comment(0)

Very late to the party, but another option is to insert some symbols into the trie that represent the start and end of words. Then, during the matching stage, they must match accordingly. I'm about to try that approach myself.

Apelles answered 28/10, 2019 at 0:40 Comment(0)

Recommended topics

Hot tags