Efficient String/Pattern Matching in C++ (suffixarray, trie, suffixtree?)
Asked Answered
D

9

8

I'm looking for an efficient data structure to do String/Pattern Matching on an really huge set of strings. I've found out about tries, suffix-trees and suffix-arrays. But I couldn't find an ready-to-use implementation in C/C++ so far (and implementing it by myself seems difficult and error-prone to me). But I'm still not sure if Suffix-Arrays are really the thing I'm looking for... I've tried libdivsufsort and esaxx, but couldn't find out how to use them for my needs:

I want to use an predefined set of strings, with wildcards (or even regular expressions) to match an user input. I got a huge list of predefined strings i.e.

"WHAT IS *?" "WHAT IS XYZ?" "HOW MUCH *?" ...

Now I want to find the best matching string (if there's one, that matches at all). I.e. User input: >WHAT IS XYZ? Should find "WHAT IS XYZ?" instead of "WHAT IS *?", but "WHAT IS SOMETHING?" should find "WHAT IS *?" (assuming * is a wildcard for any count of characters).

Building the structure isn't time critical (and the structure don't have to be super space efficient), but the search shouldn't take too long. How can that be done easily? Any Framework/Library or code example is welcome

Thanks

Darvon answered 13/11, 2012 at 16:41 Comment(6)
Have you looked at C++11's regex, or is that too slow for your purposes?Reticule
C++11 / boost::regex would be fine - but I still need some data structure, to store the strings (if I got it right I would have to search in every string with boost::regex/C++11 - that could end up slow with too much strings, I guess)Darvon
How many patterns do you have? Thousands? Millions? What kind of wildcards are used? Just asterisk? What do you mean by the search shouldn't take too long? How long is acceptable? How many searches do you expect to be carried out in parallel? Will the pattern dictionary be updated frequently?Lichen
I need probably about ten thousand patterns. Essential wildcards are only asterisk (matches everthing from 1...n characters) and another wildcard to match everything from 0...n characters (i.e. using $-sign, HELLO$ should match "HELLO XYZ", but also simply "HELLO" - like the asterisk just with another quantifier). Up to 5s would still be acceptable-but it would be best practice if the algorithm doesn't perform linear on the number of patterns (i.e. search in every pattern with a complexity O(N)). Parallel searches are not necessary and the pattern dict will never have to be updated at runtime.Darvon
@Darvon Thanks for the details. Are the wildcards always at the end of the string, as seems to be the case in your examples?Lichen
@Lichen No, the wildcards could be anywhere in the string (i.e. also "* IS GREAT" or "I * ABOUT").Darvon
L
2

Here is a solution that, I believe, should work well if you have a very large amount of patterns. For just 10k it may be overkill, and implementing it means relatively much work, but you may be interested nevertheless.

The basic idea is to create an inverted index that maps substrings of the patterns to pattern IDs. First, each pattern gets an ID:

1: what is *
2: where is *
3: do * need to
etc.

And then we create an inverted index. In the simplest case, we split the patterns into tokens and map each token to the list of pattern IDs it occurs in. We can be flexible in what we define as a token, but one method is to assume that every white-space separated word is one token. So here is the index:

what  -> 1
is    -> 1,2
where -> 2
do    -> 3
need  -> 3
to    -> 3

Then, when you get an input string from the user, you split that into tokens and look them up in the index. You combine all pattern IDs you get from the index. Example:

INPUT: what is something?

TOKENS:
   what      -> 1
   is        -> 1,2
   something -> n/a

You retrieve the pattern IDs for each token and put them into a temporary data structure that counts the frequency of each ID, for example a hash (e.g. a std::unordered_map<id_type,std::size_t>).

You then sort this by frequency to find out that rule 1 was found twice and rule 2 was found once.

You then apply the rules you found, in the order of frequency, to the input text. Here you use a regular expression library or something similar to generate matches. The most frequent rule has the most tokens in common with the input text, so it is likely to match well.

The overall advantage of the approach is that you need not apply all the rules to the input, but only those that have at least one token in common with the input, and even among those you do it in the order of how many tokens each rule shares with the input, and once you found a matching rule you could probably break off the rest of the matching procedure (or not – depending on whether or not you want all matching rules in each case, or just one that is a very good match).

Improvement The above performs the rule preselection based on tokens. Instead, you could concatenate all the rules like this:

what is *||where is *||do * need to||...

Then you construct a suffix array of this concatenated string.

Then, given an input string, you match it against the suffix array to identify all substring-matches, including matches that are smaller than one token or span across multiple tokens. In the example above I assume that the wildcard symbols * and $ are included in the suffix array, although of course no part of an input string will ever match them. You can well exclude them from the suffix array or replace them with a dummy character.

Once you determine the matches, you sort them by length. You also must map the match positions in the concatenated string to rule IDs. This is readily possible by maintaining an array of starting positions of rules relative to the concatenated string; there are also highly-optimised methods based on indexed bit vectors (I can elaborate on this if necessary).

Once you have the matching rule IDs, you do the same as in the inverted index case: Apply the matching rules, using standard regex matching (or similar).

Again, this approach is relatively complicated and makes sense only when you have a very large amount of rules, and if chances that a token-based (or substring-based) lookup reduces the number of candidate rules significantly. From the example rules you gave I assume the latter in the case, but if the number of rules you are dealing with (in the order of 10k) justifies this approach, I am not sure. It may make more sense if the total number of rules is in the 100ks or millions.

Lichen answered 22/11, 2012 at 15:38 Comment(1)
That's an very good approach. I will try to implement this, thank you!Darvon
B
3

Given your comment that the patterns do not need to be updated at runtime I'm not sure you need a runtime structure at all.

I'd recommend using re2c or ragel to compile the patterns to code that will do the pattern matching.

Bounden answered 19/11, 2012 at 0:27 Comment(0)
G
3

You might want to look at flex. From the manual:

flex is a tool for generating scanners. A scanner is a program which recognizes lexical patterns in text. The flex program reads the given input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called rules. flex generates as output a C source file, lex.yy.c by default, which defines a routine yylex(). This file can be compiled and linked with the flex runtime library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.

Also this:

The main design goal of flex is that it generate high-performance scanners. It has been optimized for dealing well with large sets of rules.

For example, this scanner matches the three patterns in your post:

%%
"WHAT IS XYZ?"      puts("matched WHAT-IS-XYZ");
"WHAT IS ".*"?"     puts("matched WHAT-IS");
"HOW MUCH ".*"?"    puts("matched HOW-MUCH");

Flex works by generating a discrete finite automaton (DFA). A DFA looks at each input character exactly once. There is no backtracking, even when matching wildcards. Run time is O(N) where N is the number of input characters. (More patterns will generate larger DFA tables, which will cause more cache misses, so there is some penalty for more patterns. But that is true of any matching system I can think of.)

However, you will have to list your patterns in the proper order to match them correctly. Flex may tell you if there's a problem. For example, if you reverse the order of the WHAT-IS-XYZ and WHAT-IS patterns in the above scanner, flex will tell you:

:; flex matcher.l
matcher.l:3: warning, rule cannot be matched

If you can meet flex's requirements, flex should give you a very fast scanner.

Grist answered 22/11, 2012 at 9:10 Comment(0)
R
3

Many of the above answers won't work well in practice, or do not constitute the best known answer to your question: for instance, using scanner generators from compiler construction (re2c, lex, flex, ...) is likely to fail for large lexicons, as these tools were designed for programming languages, which typically have at most a few hundred built-in keywords.

There are two alternative approaches that I can recommend:

(1) choose a C++ trie class that maps strings to size_t ids used in a second array to store the payload data. Use a Web search engine search query such as:

c++ trie c++14 implementation fast "small footprint" site:github.com

to find suitable candidates - at the time of writing e.g. https://github.com/Tessil/hat-trie looks quite good (clean, portable, modern code, and there's a scientific paper to go with it, which adds credibility); or

(2) represent the mapping from string lexicon to payload data as a Finite State Transducer (FST), an extension of NFAs (automata with output), as realized by FOMA, XFST or OpenFST.

The first option means you will have to test available libraries for ease of use and scalability. The second option means you will have to make yourself familiar with FSTs and existing implementation libraries and their licenses.

(The second approach is used by computational linguists to model large word-lists as well as governments to scan everything you write on the Internet, so it scales very well.)

Romanic answered 19/4, 2020 at 11:6 Comment(0)
I
2

Check out CritBit trees:

Example source code that's trivial to C++-ise if you really feel the need.

To find all matches you use the function critbit0_allprefixed

e.g.

// Find all strings that start with, or are equal to, "WHAT IS"`
critbit0_allprefixed(tree, "WHAT IS", SomeCallback);`

SomeCallback is called for each match.

Intercalate answered 13/11, 2012 at 17:6 Comment(4)
Can you explain how you'd use crit-bit trees for the kinds of wildcard searches the OP asked for?Lichen
Oh, so you'd use the patterns (such as WHAT IS *) as queries, and put the user input into the tree? Or am I misunderstanding this? Also, is it correct that this works only if the wildcard is at the end of the pattern?Lichen
@Lichen You walk the tree until the end of pattern. Then you walk until the leaf nodes, these are the matches. The function does it all for you, all you need to do is provide it a callback to call. You don't need wildcards, the example in my answer returns all strings in the tree that begin with, or are equal to, "WHAT IS".Intercalate
Ok, but what the OP described was a scenario where the patterns are static, and the input to match against comes from the user. So, WHAT IS * would be in the tree, and you'd do search(tree,"WHAT IS SOMETHING",callback). I can see that this can still be made to work (albeit critbit0_allprefixed wouldn't do it directly), but I can also see that this would get much more difficult if the wildcards are allowed somewhere in the middle or at the beginning of the patterns.Lichen
C
2

Have you tried Ternary search tree? Here is a c++ implementation: http://code.google.com/p/ternary-search-tree/

I do not have any experience of how slow it is to build a ternary tree but I do know that search is very fast.

[edit]

For matching wildcard inside the tree in partialMatchSearch: (disclaimer: this is just a suggestion and not tested in any way)

you could add * symbols to the tree and an if-clause like this in the beginning of partialMatchSearch function:

  if ( ( *key != 0 ) && tree->splitChar == '*' )
  {
     this->partialMatchSearch( tree, key+1 );
  }

in other words just recursively call the partialMatchSearch with the same node but the string set to next char.

Cousteau answered 20/11, 2012 at 8:21 Comment(4)
Looks great! Any idea how I could modify the 'partialMatchSearch' to allow wildcards in the tree nodes, instead of wildcards in the search term?Darvon
What kind of wildcards? multiple char or only single chars? for example the tree should be of "form" a.d and matches "and", "add" or should the tree be of "form" a*d and matches "added"Cousteau
The tree should be of "form" a*d and matches "added"Darvon
Please see my edit. I have not tested it but I think it should point you in the right direction.Cousteau
L
2

Here is a solution that, I believe, should work well if you have a very large amount of patterns. For just 10k it may be overkill, and implementing it means relatively much work, but you may be interested nevertheless.

The basic idea is to create an inverted index that maps substrings of the patterns to pattern IDs. First, each pattern gets an ID:

1: what is *
2: where is *
3: do * need to
etc.

And then we create an inverted index. In the simplest case, we split the patterns into tokens and map each token to the list of pattern IDs it occurs in. We can be flexible in what we define as a token, but one method is to assume that every white-space separated word is one token. So here is the index:

what  -> 1
is    -> 1,2
where -> 2
do    -> 3
need  -> 3
to    -> 3

Then, when you get an input string from the user, you split that into tokens and look them up in the index. You combine all pattern IDs you get from the index. Example:

INPUT: what is something?

TOKENS:
   what      -> 1
   is        -> 1,2
   something -> n/a

You retrieve the pattern IDs for each token and put them into a temporary data structure that counts the frequency of each ID, for example a hash (e.g. a std::unordered_map<id_type,std::size_t>).

You then sort this by frequency to find out that rule 1 was found twice and rule 2 was found once.

You then apply the rules you found, in the order of frequency, to the input text. Here you use a regular expression library or something similar to generate matches. The most frequent rule has the most tokens in common with the input text, so it is likely to match well.

The overall advantage of the approach is that you need not apply all the rules to the input, but only those that have at least one token in common with the input, and even among those you do it in the order of how many tokens each rule shares with the input, and once you found a matching rule you could probably break off the rest of the matching procedure (or not – depending on whether or not you want all matching rules in each case, or just one that is a very good match).

Improvement The above performs the rule preselection based on tokens. Instead, you could concatenate all the rules like this:

what is *||where is *||do * need to||...

Then you construct a suffix array of this concatenated string.

Then, given an input string, you match it against the suffix array to identify all substring-matches, including matches that are smaller than one token or span across multiple tokens. In the example above I assume that the wildcard symbols * and $ are included in the suffix array, although of course no part of an input string will ever match them. You can well exclude them from the suffix array or replace them with a dummy character.

Once you determine the matches, you sort them by length. You also must map the match positions in the concatenated string to rule IDs. This is readily possible by maintaining an array of starting positions of rules relative to the concatenated string; there are also highly-optimised methods based on indexed bit vectors (I can elaborate on this if necessary).

Once you have the matching rule IDs, you do the same as in the inverted index case: Apply the matching rules, using standard regex matching (or similar).

Again, this approach is relatively complicated and makes sense only when you have a very large amount of rules, and if chances that a token-based (or substring-based) lookup reduces the number of candidate rules significantly. From the example rules you gave I assume the latter in the case, but if the number of rules you are dealing with (in the order of 10k) justifies this approach, I am not sure. It may make more sense if the total number of rules is in the 100ks or millions.

Lichen answered 22/11, 2012 at 15:38 Comment(1)
That's an very good approach. I will try to implement this, thank you!Darvon
A
1

This isn't the question you're asking. You wanted something that is pre-cooked. But...

How complex does this need to be? If I were trying to do what you are asking, I would try something a bit more space intensive but much less time intensive.

I would (and have) start with a tree with the index 0 of the alphabet.

Then each child node would be a word (the dictionary).

Then each child node of that would be a potential string segment match knowing, for example, that "round" almost never DIRECTLY follows "square". You may say, "Put the round peg in the square hole," but the word that follows "round" is "peg". So the segment matches for "round" would be "round peg", "round cup", "round ball". I'd strike out articles also because they mean nothing to the sentence (usually). The above paragraph would translate to, "each child node is word".

I would add a heuristic, though, as even an advanced B-tree can get slow when you have that much data. I have seen them slow down to a minute or more when searching very large data sets.

That is assuming that you didn't want to actually use a database, which would probably be fastest of all unless you wanted to code in ASM.

Attach answered 22/11, 2012 at 12:32 Comment(0)
B
0

If space is no issue then you could do the following, off the top of my head.

Create a tree that has children of the possible characters at this point in the tree where the current level is the index into the string so. The first level of the tree is index level 0 or rather array index 0 in the string.

Each level would be a index into the string so the root node would be index 0 and it's children would be index 1. Each node would have N children equal to the number of possible characters at this point in the string.

So, say you had the root have the set of possible [ "a", "b", "c" ] it would have three children. Then say you wanted to find a possible match for the string "ab" you would recurse to the child that has the route of "a" and go from there.

If you reach the end of the string before getting to a leaf node then the list of possible strings is the entire sub tree of all children at your current node.

Hope that made any sense, but it would look sort of like a huffman tree, but each leaf would a potential string to choose from.

Bennybenoit answered 13/11, 2012 at 16:52 Comment(1)
Huh, ya it really is, should have read past the first sentence on the wiki then. Just ends in a set of strings and not a integer for my description.Bennybenoit
T
0

I would take Kernighan and Pike's advice and pick a reasonable algorithm and then brute-force it.

All your examples are looking for "longest prefix", which suggests a simple trie rather than a suffix-tree to me. Given you only need ~10k strings stored, you should be able to implement a char-trie in a couple of hours at most, using nested STL maps.

Unless memory is tight or performance truly is critical, this should be fine.

Tiercel answered 22/11, 2012 at 6:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.