Algorithm to find multiple string matches
Asked Answered
C

6

26

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.

Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.

I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.

An example of search terms could be a list of Fortune 500 company names.

Ideas?

Cony answered 15/7, 2010 at 23:55 Comment(2)
What exactly do you want? Iteration over all matches, a list of the matches (i.e. the values), just the knowledge that at least one term is in the text, ...?Morbid
I need to know if which of the search terms exist in the documentCony
R
28

Don´t reinvent the wheel

This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.

The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.

In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.

And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.

A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.

Rotund answered 16/7, 2010 at 5:46 Comment(7)
Thanks for the response. I asked the question to avoid reinventing something I felt was probably well researched. I'll take a look at agrep src.Cony
@Dwight: If the 'large body of text' is static, I would suggest creating a trie/dawg and I would think this would beat agrep's algorithms, which are more suitable for cases where the search terms are fixed but the text to search vary. In any case, since you accepted this answer, I suppose your large body of text is not really static. I suggest you modify your question to add that information.Spin
@Moron The OP says that "Terms to search for will be contained in a list and can have 1000+ possibilities." So I assumed that the text is not static, because if it were, and you have 1500 terms to search, you could just build a tree in advance and perform binary searches.Rotund
@Belisarius: Just trying to make it clear for future readers :-) From Dwight's comment in response to Georg on this question, it seems like the text("document") is static, so there is scope for confusion.Spin
@Moron Maybe I misunderstood the conditions. I think the better I can do now is to agree that this answer makes sense only for a "dynamic" document.Rotund
@beli: Not your fault: The question isn't very clear. Since OP accepted this answer, I presume you interpreted the question correctly. I only wanted OP to edit the question to clarify it. In case someone with a static document comes upon this question, our comments here and the edit to question (if made) will hopefully help...Spin
the text to be searched will not be static. The match list could be preprocessed if necessary.Cony
Z
19

As in the case of single patterns, there are several algorithms for multiple-pattern matching, and you will have to find the one that fits your purpose best. The paper A fast algorithm for multi-pattern searching (archived copy) does a review of most of them, including Aho-Corasick (which is kind of the multi-pattern version of the Knuth-Morris-Pratt algorithm, with linear complexity) and Commentz-Walter (a combination of Boyer-Moore and Aho-Corasick), and introduces a new one, which uses ideas from Boyer-Moore for the task of matching multiple patterns.

An alternative, hash-based algorithm not mentioned in that paper, is the Rabin-Karp algorithm, which has a worst-case complexity bigger than other algorithms, but compensates it by reducing the linear factor via hashing. Which one is better depends ultimately on your use case. You may need to implement several of them and compare them in your application if you want to choose the fastest one.

Zinnes answered 28/5, 2013 at 12:43 Comment(2)
Link to the paper is broken.Satirical
Thanks for the notice. I've replaced it with an archived version.Zinnes
S
4

Assuming that the large body of text is static english text and you need to match whole words you can try the following (you should really clarify what exactly is a 'match', what kind of text you are looking at etc in your question).

First preprocess the whole document into a Trie or a DAWG.

Trie/Dawg has the following property:

Given a trie/dawg and a search term of length K, you can in O(K) time lookup the data associated with the word (or tell if there is no match).

Using a DAWG could save you more space as compared to a trie. Tries exploit the fact that many words will have a common prefix and DAWGs exploit the common prefix as well as the common suffix property.

In the trie, also maintain exactly the list of positions of the word. For example if the text is

That is that and so it is.

The node for the last t in that will have the list {1,3} and the node for s in is will have the list {2,7} associated.

Now when you get a single word search term, you can walk the trie and get the list of matches for that word easily.

If you get a multiple word search term, you can do the following.

Walk the trie with the first word in the search term. Get the list of matches and insert into a hashTable H1.

Now walk the trie with the second word in the search term. Get the list of matches. For each match position x, check if x-1 exists in the HashTable H1. If so, add x to new hashtable H2.

Walk the trie with the third word, get list of matches. For each match position y, check if y-1 exists in H3, if so add to new hashtable H3.

Continue so forth.

At the end you get a list of matches for the search phrase, which give the positions of the last word of the phrase.

You could potentially optimize the phrase matching step by maintaining a sorted list of positions in the list and doing a binary search: i.e for eg. for each key k in H2, you binary search for k+1 in the sorted list for search term 3 and add k+1 to H3 if you find it etc.

Spin answered 16/7, 2010 at 1:22 Comment(0)
C
4

An optimal solution for this problem is to use a suffix tree (or a suffix array). It's essentially a trie of all suffixes of a string. For a text of length O(N), this can be built in O(N).

Then all k occurrences of a string of length m can be answered optimally in O(m + k).

Suffix trees can also be used to efficiently find e.g. the longest palindrome, the longest common substring, the longest repeated substring, etc.

This is the typical data structure to use when analyzing DNA strings which can be millions/billions of bases long.

See also

  • Wikipedia/Suffix tree
  • Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology (Dan Gusfield).
Corporeity answered 16/7, 2010 at 7:28 Comment(2)
Do you mean a text of length N?Figuration
@JAB: yes, I could've just said length N instead of O(N), but N is O(N), so it's still correct =)Corporeity
L
1

So you have lots of search terms and want to see if any of them are in the document?

Purely algorithmically, you could sort all your possibilities in alphabetical order, join them with pipes, and use them as a regular expression, if the regex engine will look at /ant|ape/ and properly short-circuit the a in "ape" if it didn't find it in "ant". If not, you could do a "precompile" of a regex and "squish" the results down to their minimum overlap. I.e. in the above case /a(nt|pe)/ and so on, recursively for each letter.

However, doing the above is pretty much like putting all your search strings in a 26-ary tree (26 characters, more if also numbers). Push your strings onto the tree, using one level of depth per character of length.

You can do this with your search terms to make a hyper-fast "does this word match anything in my list of search terms" if your search terms number large.

You could theoretically do the reverse as well -- pack your document into the tree and then use the search terms on it -- if your document is static and the search terms change a lot.

Depends on how much optimization you need...

Lycurgus answered 16/7, 2010 at 0:4 Comment(0)
W
0

Are the search terms words that you are looking for or can it be full sentances too ?

If it's only words, then i would suggest building a Red-Black Tree from all the words, and then searching for each word in the tree.

If it could be sentances, then it could get a lot more complex... (?)

Wormhole answered 16/7, 2010 at 0:3 Comment(1)
No need to build your own red-black tree. A std::multimap would be sufficient.Wellbeing

© 2022 - 2024 — McMap. All rights reserved.