Here is a solution that, I believe, should work well if you have a very large amount of patterns. For just 10k it may be overkill, and implementing it means relatively much work, but you may be interested nevertheless.
The basic idea is to create an inverted index that maps substrings of the patterns to pattern IDs. First, each pattern gets an ID:
1: what is *
2: where is *
3: do * need to
etc.
And then we create an inverted index. In the simplest case, we split the patterns into tokens and map each token to the list of pattern IDs it occurs in. We can be flexible in what we define as a token, but one method is to assume that every white-space separated word is one token. So here is the index:
what -> 1
is -> 1,2
where -> 2
do -> 3
need -> 3
to -> 3
Then, when you get an input string from the user, you split that into tokens and look them up in the index. You combine all pattern IDs you get from the index. Example:
INPUT: what is something?
TOKENS:
what -> 1
is -> 1,2
something -> n/a
You retrieve the pattern IDs for each token and put them into a temporary data structure that counts the frequency of each ID, for example a hash (e.g. a std::unordered_map<id_type,std::size_t>
).
You then sort this by frequency to find out that rule 1 was found twice and rule 2 was found once.
You then apply the rules you found, in the order of frequency, to the input text. Here you use a regular expression library or something similar to generate matches. The most frequent rule has the most tokens in common with the input text, so it is likely to match well.
The overall advantage of the approach is that you need not apply all the rules to the input, but only those that have at least one token in common with the input, and even among those you do it in the order of how many tokens each rule shares with the input, and once you found a matching rule you could probably break off the rest of the matching procedure (or not – depending on whether or not you want all matching rules in each case, or just one that is a very good match).
Improvement The above performs the rule preselection based on tokens. Instead, you could concatenate all the rules like this:
what is *||where is *||do * need to||...
Then you construct a suffix array of this concatenated string.
Then, given an input string, you match it against the suffix array to identify all substring-matches, including matches that are smaller than one token or span across multiple tokens. In the example above I assume that the wildcard symbols *
and $
are included in the suffix array, although of course no part of an input string will ever match them. You can well exclude them from the suffix array or replace them with a dummy character.
Once you determine the matches, you sort them by length. You also must map the match positions in the concatenated string to rule IDs. This is readily possible by maintaining an array of starting positions of rules relative to the concatenated string; there are also highly-optimised methods based on indexed bit vectors (I can elaborate on this if necessary).
Once you have the matching rule IDs, you do the same as in the inverted index case: Apply the matching rules, using standard regex matching (or similar).
Again, this approach is relatively complicated and makes sense only when you have a very large amount of rules, and if chances that a token-based (or substring-based) lookup reduces the number of candidate rules significantly. From the example rules you gave I assume the latter in the case, but if the number of rules you are dealing with (in the order of 10k) justifies this approach, I am not sure. It may make more sense if the total number of rules is in the 100ks or millions.