Combining multiple regular expressions into one automaton
Asked Answered
D

3

11

Let's say that I have a list of regular expressions (read from an external source - file, database etc). I want to check which of these regular expressions a string matches.

I can create iterate through all these regular expression and match them, but the list could be a huge one and this is a critical operation.

I can combine all these regular expressions into one (with | between them), but then the problem is that I can identify only the first matched regular expression, not all.

Another idea could be to create an automaton for all these regular expression and to mark the final states with, let's say, the indexes of the corresponding regular expression. I was looking at http://cs.au.dk/~amoeller/automaton/, a library which seems capable to work with regular expressions and automaton, but not sure it can be extended to solve my problem.

Do you have any other ideas?

To clarify some comments, I've added a code sample:

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class PatternTest {
    public static void main(String[] args) {
        Pattern p = Pattern.compile("(a(?:b|c)a)|((?:a|b)ba)|(ab(?:a|c))");     
        Matcher m = p.matcher("aba");
        System.out.println(m.matches());
        System.out.println(m.groupCount());
        for (int i = 0, n = m.groupCount(); i < n; i++) {
            System.out.println(m.group(i));
        }
    }
}

will print out

true
3
aba
aba
null

As you can see only the first group is matched and I don't see a way to match the other two ones.

More findings - Using the automaton library above, the problem will reduce to the following: if you concatenate two or more automatons, how you can identify for a final state to which of the original automatons is corresponding?

Deraign answered 8/3, 2013 at 15:47 Comment(6)
Have you considered adding named groups to each of the |'ed expressions? You could check which ones match then.Enyo
Those sound like the options you have for Java. In Perl tho it would be easier. You can just alternate all the expressions, and at the end of every expression (called X) add for example (?{ $matched{X} = 1 })(?!). Which marks the expression X as matched, and then fails the match, allowing other expressions to match too. (To optimize it you can also put each expression in an atomic capturing group.)Sororate
@MichaelW: Yes, I considered this as well. The problem is that regexp in Java matches only the first group (named or unnamed).Deraign
@Qtax: The question was if I can do something like this in Java.Deraign
Even if this was possible it doesn't look like it would be a nicer design, and it might not even be faster than doing them separately.Altarpiece
If the regular expressions are implemented with automaton, then the algorithm complexity will depend on the depth of the automaton. Combining them will result in a complexity dependent on the max length of those regexps, unlike iterating through them which will result in the sum.Deraign
B
3

dk.brics.automaton does not directly support this, but you can generalize the representation of automata (and the relevant automata operations) to distinguish between different kinds of accept states. Start by adding an int field, for example, to the State class and use this field whenever 'accept' is set.

Bartell answered 9/3, 2013 at 15:24 Comment(0)
T
6

I implemented such a solution based on dk.brics.automaton, you can find it here. https://github.com/fulmicoton/multiregexp

Tenacious answered 20/10, 2013 at 17:14 Comment(0)
T
3

For a definitive answer (if there is one) we would need some more information, like:

  1. You say the list of regexes is huge; can you be more specific? Thousands? Millions? Billions and billions?

  2. Who wrote these regexes, and do they know what they're doing? Are the regexes thoroughly tested (for correctness and performance) before being added to the list?

  3. In your sample code you use the matches() method, which requires the regex to describe the whole string. It acts as if the regex were really
    \A(?:(a(?:b|c)a)|((?:a|b)ba)|(ab(?:a|c)))\z
    which matches "aba" but not "aaba" or "abaa". If you've used regexes in other tools or languages before coming to Java, this behavior might surprise you. Traditionally, a string has always been said to "match" a regex if the regex describes any substring within the string, even a zero-length substring. To get that behavior in Java, you have to use the Matcher's find() method.

  4. Are there any common factors you can pull out of all the regexes in the list, like minimum or maximum length, common substrings, or shared character subsets? For example, any string that matches one of your sample patterns must also match [abc]{3}. If there are, you might want to create filters based on them (maybe regex, maybe not) to run before the serious matching gets underway. (I wouldn't suggest this if you were using Perl, which is choc-a-bloc with optimizations like that already, but Java's not too proud to accept a little help. ☺)

But I feel pretty safe advising you to go with separate regexes rather than concatenating them all into one. The Frankenregex wouldn't necessarily perform better, and troubleshooting it would be a nightmare! You can pre-compile all the Pattern objects, and you can create a Matcher object ahead of time and reuse it for all the matches, like so:

m.reset(s).usePattern(p);

Here's a demo. I can't make any guarantees (you're still at the mercy of whoever wrote the regexes, for one thing), but if a solution is possible, I think this is the most promising approach.

Taoism answered 9/3, 2013 at 15:12 Comment(1)
Great answer. Perhaps because it was what I was thinking anyway, but the added demo was good and I learnt about the reset(x) functionality that I didn't consider before.Yalu
B
3

dk.brics.automaton does not directly support this, but you can generalize the representation of automata (and the relevant automata operations) to distinguish between different kinds of accept states. Start by adding an int field, for example, to the State class and use this field whenever 'accept' is set.

Bartell answered 9/3, 2013 at 15:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.