Generative regular expressions
Asked Answered
C

2

7

Typically in our work we use regular expressions in capture or match operations.

However, regular expressions can be used - manually at least - to generate legal sentences that match the regular expression. Of course, some regular expressions can match infinitely long sentences, e.g., the expression .+.

I have a problem that could be solved by using a regular expression sentence generating algorithm.

In pseudocode, it would operate something like this:

re = generate("foo(bar|baz)?", max_match = 100);  #Don't give me more than 100 results
assert re == ("foobar", "foobaz", "foo");

What algorithm would perform this for me?

Contredanse answered 17/11, 2010 at 20:13 Comment(3)
I know how to do this easily with a given search string and agiven pattern. Is that good enough? If so, tell me and I’ll show you. You are very smart to give it an upper bound, too. I can do that. But there are infinitely many strings otherwise, so I don’t know how you would do that, although Bart Miller’s “fuzz testing” might perhaps apply, wherein he generates random input to feed programs to see whether that causes them to fail spectacularly.Herr
@tchrist: I can generate random garbage quite nicely. I would like to do something just like the above example shows. My rummaging shows that the Perl module String::Random is very like Xeger, but doesn't support (|). Xeger itself just walks the automata that the regex describes. That appears to be a common case. I read that Haskell has a regexp module that generates, I'm digging on that atm.Contredanse
Couldn't find the haskell regexp module. :-/Contredanse
C
2

Microsoft has a SMT-based gratis (MSRL-licensed) "Rex" tool for this: http://research.microsoft.com/en-us/downloads/7f1d87be-f6d9-495d-a699-f12599cea030/

From the Introduction section of the "Rex: Symbolic Regular Expression Explorer" paper:

We translate (extended) regular expressions or regexes [5] into a symbolic representation of finite automata called SFAs. In an SFA, moves are labeled by formulas representing sets of characters rather than individual characters. An SFA A is translated into a set of (recursive) axioms that describe the acceptance condition for the strings accepted by A and build on the representation of strings as lists.

As the SMT solver can output all possible solutions within some size bound, this may be close to what you're looking for.

On a more statistical and less formal front, the Regexp::Genex module from CPAN can work as well: http://search.cpan.org/dist/Regexp-Genex/

You can use it with something like this:

#!/usr/bin/env perl
use Regexp::Genex ':all';
my $hits = 100;
my $re = qr/[a-z](123|456)/;
local $Regexp::Genex::DEFAULT_LEN = length $re;
my %seen;
while ((time - $^T) < 2) {
    @seen{strings($re)} = ();
    $Regexp::Genex::DEFAULT_LEN++;
}
print "$_\n" for (sort %seen)[0..$hits-1];

Adjust the time and sample size as needed. Hope this helps!

Carbrey answered 21/5, 2011 at 19:30 Comment(2)
I've just implemented another "output all possible solutions" tool at github.com/audreyt/regex-genex with the yices2 SMT solver. Might be useful as well. :-)Carbrey
The Rex project's research page seems to have moved to here: microsoft.com/en-us/research/project/…Karol
E
1

Take a look at Xeger (Google Code).

The Visual Studio Team System appears to have an inverse regex generator, too, but it doesn't look like the algorithm is open source.

Eelgrass answered 17/11, 2010 at 20:21 Comment(1)
Hmm. That's a random generator. I would like some sort of confidence that, for a finite language that the regex describes, I can enumerate all words in the language (yes, I could generate some sort of statistical measure, which might work, but I'm not good enough at advanced stats to be confident in my confidence interval...).Contredanse

© 2022 - 2024 — McMap. All rights reserved.