Let's say that I have a list of regular expressions (read from an external source - file, database etc). I want to check which of these regular expressions a string matches.
I can create iterate through all these regular expression and match them, but the list could be a huge one and this is a critical operation.
I can combine all these regular expressions into one (with | between them), but then the problem is that I can identify only the first matched regular expression, not all.
Another idea could be to create an automaton for all these regular expression and to mark the final states with, let's say, the indexes of the corresponding regular expression. I was looking at http://cs.au.dk/~amoeller/automaton/, a library which seems capable to work with regular expressions and automaton, but not sure it can be extended to solve my problem.
Do you have any other ideas?
To clarify some comments, I've added a code sample:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class PatternTest {
public static void main(String[] args) {
Pattern p = Pattern.compile("(a(?:b|c)a)|((?:a|b)ba)|(ab(?:a|c))");
Matcher m = p.matcher("aba");
System.out.println(m.matches());
System.out.println(m.groupCount());
for (int i = 0, n = m.groupCount(); i < n; i++) {
System.out.println(m.group(i));
}
}
}
will print out
true
3
aba
aba
null
As you can see only the first group is matched and I don't see a way to match the other two ones.
More findings - Using the automaton library above, the problem will reduce to the following: if you concatenate two or more automatons, how you can identify for a final state to which of the original automatons is corresponding?
X
) add for example(?{ $matched{X} = 1 })(?!)
. Which marks the expressionX
as matched, and then fails the match, allowing other expressions to match too. (To optimize it you can also put each expression in an atomic capturing group.) – Sororate