Create set of all possible matches for a given regex
Asked Answered
H

5

12

I'm wondering how to find a set of all matches to a given regex with a finite number of matches.

For example:

All of these example you can assume they start with ^ and end with $

`hello?` -> (hell, hello)
`[1-9][0-9]{0,3}` -> (1,2,3 ..., 9998, 9999)
`My (cat|dog) is awesome!` -> (My cat is awesome!, My dog is awesome!)
`1{1,10}` -> (1,11, ..., 111111111, 1111111111)
`1*` -> //error
`1+` -> //error
`(1|11){2}` -> (1,11,111,1111) //notice how it doesn't repeat any of the possibilities

I'd also be interested if there was a way of retrieving count a unique a solutions to the regex or if there is a way to determine if the regex has a finite solutions.

It would be nice if the algorithm could parse any regex, but a powerful enough subset of the regex would be fine.

I'm interested in a PHP solution to this problem, but other languages would also be fine.

EDIT:

I've learned in my Formal Theory class about DFA which can be used to implement regex (and other regular languages). If I could transform the regex into a DFA the solution seems fairly straight forward to me, but that transformation seems rather tricky to me.

EDIT 2:

Thanks for all the suggestions, see my post about the public github project I'm working on to "answer" this question.

Hyacinth answered 30/9, 2011 at 19:9 Comment(7)
Great question. I imagine something that could do this would be very useful for unit-testing.Interlunation
@drrcknlsn That was one of my thoughts, I was thinking of using it to generate a complete cache for a regex based routing system for a MVC.Hyacinth
You are assuming implicit anchors. It is easy to show all possible ways of matching a given string. For example, given "Hello world", the pattern /hel+o?/i matches Hello, Hell, and Hel. That is not the same as generation, though.Latinize
@Latinize All of these example you can assume they start with ^ and end with $Hyacinth
It's Java rather than PHP, but it'll get you started. #22615 Also take a look at #3166309Cutting
Which is it? language agnostic [i.e. general solutions, for every language] or php [solution can and should use php tools]. Also: are you assuming ascii or unicode? for unicode the regex ... could be problematic [too much possibilities]Bouillon
I know how to generate all possiblities given a concrete/finite input set, which I have provided in my solution. If that does you no good whatsoever, let me know.Latinize
H
0

I have begun working on a solution on Github. It can already lex most examples and give the solution set for finite regex.

It currently passes the following unit tests.

<?php

class RegexCompiler_Tests_MatchTest extends PHPUnit_Framework_TestCase
{

    function dataProviderForTestSimpleRead()
    {
        return array(
            array( "^ab$", array( "ab" ) ),
            array( "^(ab)$", array( "ab" ) ),
            array( "^(ab|ba)$", array( "ab", "ba" ) ),
            array( "^(ab|(b|c)a)$", array( "ab", "ba", "ca" ) ),
            array( "^(ab|ba){0,2}$", array( "", "ab", "ba", "abab", "abba", "baab", "baba" ) ),
            array( "^(ab|ba){1,2}$", array( "ab", "ba", "abab", "abba", "baab", "baba" ) ),
            array( "^(ab|ba){2}$", array( "abab", "abba", "baab", "baba" ) ),
            array( "^hello?$", array( "hell", "hello" ) ),
            array( "^(0|1){3}$", array( "000", "001", "010", "011", "100", "101", "110", "111" ) ),
            array( "^[1-9][0-9]{0,1}$", array_map( function( $input ) { return (string)$input; }, range( 1, 99 ) ) ),
            array( '^\n$', array( "\n" ) ),
            array( '^\r$', array( "\r" ) ),
            array( '^\t$', array( "\t" ) ),
            array( '^[\\\\\\]a\\-]$', array( "\\", "]", "a", "-" ) ), //the regex is actually '^[\\\]a\-]$' after PHP string parsing
            array( '^[\\n-\\r]$', array( chr( 10 ), chr( 11 ), chr( 12 ), chr( 13 ) ) ),
        );
    }

    /**
     * @dataProvider dataProviderForTestSimpleRead
     */

    function testSimpleRead( $regex_string, $expected_matches_array )
    {
        $lexer = new RegexCompiler_Lexer();
        $actualy_matches_array = $lexer->lex( $regex_string )->getMatches();
        sort( $actualy_matches_array );
        sort( $expected_matches_array );
        $this->assertSame( $expected_matches_array, $actualy_matches_array );
    }

}

?>

I would like to build an MatchIterator class that could handle infinite lists as well as one that would randomly generate matches from the regex. I'd also like to look into building regex from a match set as a way of optimizing lookups or compressing data.

Hyacinth answered 5/10, 2011 at 4:45 Comment(0)
F
3

The transformation from a regex to a DFA is pretty straightforward. The issue you'll run into there, though, is that the DFA generated can contain loops (e.g, for * or +), which will make it impossible to expand fully. Additionally, {n,n} can't be represented cleanly in a DFA, as a DFA has no "memory" of how many times it's looped.

What a solution to this problem will boil down to is building a function which tokenizes and parses a regular expression, then returns an array of all possible matches. Using recursion here will help you a lot.

A starting point, in pseudocode, might look like:

to GenerateSolutionsFor(regex):
    solutions = [""]
    for token in TokenizeRegex(regex):
        if token.isConstantString:
            for sol in solutions: sol.append(token.string)
        else if token.isLeftParen:
            subregex = get content until matching right paren
            subsols = GenerateSolutionsFor(subregex)
            for sol in solutions:
                for subsol in subsols:
                    sol.append(subsol)
        else if token.isVerticalBar:
            solutions.add(GenerateSolutionsFor(rest of the regex))
        else if token.isLeftBrace:
            ...
Facesaving answered 30/9, 2011 at 19:32 Comment(4)
This is pretty much what I had in mind, but I don't understand how TokenizeRegex would work. That seems to be the hardest part of the whole problem for me.Hyacinth
@KendallHopkins: given the context of the answer, the tokenizer is just a lexer built for each of the regex symbols. As long as you don't mind using regex in your regex analysis tool, any lex tool should work. It's only if you want to avoid using regex that it looks difficult. Anyway, might as well use what an actual implementation uses. See https://mcmap.net/q/475827/-regex-bnf-grammar as an example.Saccharin
A valid implementation of TokenizeRegex would just be to split the string up into its individual characters. Joining together ranges of plain characters which can be treated as a unit will boost performance (e.g, "hell", "o", "?"), but it's not crucial.Facesaving
@Kendall: Check out "Regular Expression Matching Can Be Simple and Fast". It explains how to translate (a much simpler) regex syntax into an NFA, which can be converted into a DFA; it's a great article, and might help out with thinking about it.Brana
R
2

I'm wondering how to find a set of all matches to a given regex with a finite number of matches.

Because you're only considering regular expressions denoting finite languages, you're actually considering a subset of the regular expressions over an alphabet. In particular, you're not dealing with regular expressions constructed using the Kleene star operator. This suggests a simple recursive algorithm for constructing the set of strings denoted by the regular expressions without Kleene star over an alphabet Σ.

LANG(a)     = {a} for all a ∈ Σ
LANG(x ∪ y) = LANG(x) ∪ LANG(y)
LANG(xy)    = {vw : v ∈ LANG(x) ∧ w ∈ LANG(y)}

Consider a regular expression such as a(b ∪ c)d. This is precisely the structure of your cats and dogs example. Executing the algorithm will correctly determine the language denoted by the regular expression:

LANG(a((b ∪ c)d)) = {xy : x ∈ LANG(a) ∧ y ∈ LANG((b ∪ c)d)}
                  = {xy : x ∈ {a} ∧ y ∈ {vw : v ∈ LANG(b ∪ c) ∧ w ∈ LANG{d}}}
                  = {ay : y ∈ {vw : v ∈ (LANG(b) ∪ LANG(c)) ∧ w ∈ {d}}}
                  = {ay : y ∈ {vd : v ∈ {b} ∪ {c}}
                  = {ay : y ∈ {vd : v ∈ {b,c}}}
                  = {ay : y ∈ {bd, cd}}
                  = {abd, acd}

You also ask whether there is an algorithm that determines whether a regular language is finite. The algorithm consists in constructing the deterministic finite automaton accepting the language, then determining whether the transition graph contains a walk from the start state to a final state containing a cycle. Note that the subset of regular expressions constructed without Kleene star denote finite languages. Because the union and concatenation of finite sets is finite, this follows by easy induction.

Remonstrate answered 4/10, 2011 at 19:28 Comment(2)
Good answer, but you might want to clarify what you mean by "checking for cycles". Certainly the graph of a DFA containing a cycle is necessary for the language it accepts to be infinite, but it is not sufficient.Sampan
You're right. I changed "checking for cycles" to "determining whether there is a walk from the start state to a final state containing a cycle." The latter is a sufficient condition, the former merely necessary. Thanks.Remonstrate
S
0

This probably doesn't answer all your questions / needs, but maybe it's a good starting point. I was searching for a solution for auto-generating data that matches a regexp a while ago, and i found this perl module Parse::RandGen, Parse::RandGen::RegExp, which worked quite impressivly good for my needs:

http://metacpan.org/pod/Parse::RandGen

Slain answered 4/10, 2011 at 20:17 Comment(0)
B
0

You might want to look at this Regex library, which parses a RegEx syntax (albeit a bit different from the perl standard) and can construct a DFA from it: http://www.brics.dk/automaton/

Babbie answered 5/10, 2011 at 1:4 Comment(0)
H
0

I have begun working on a solution on Github. It can already lex most examples and give the solution set for finite regex.

It currently passes the following unit tests.

<?php

class RegexCompiler_Tests_MatchTest extends PHPUnit_Framework_TestCase
{

    function dataProviderForTestSimpleRead()
    {
        return array(
            array( "^ab$", array( "ab" ) ),
            array( "^(ab)$", array( "ab" ) ),
            array( "^(ab|ba)$", array( "ab", "ba" ) ),
            array( "^(ab|(b|c)a)$", array( "ab", "ba", "ca" ) ),
            array( "^(ab|ba){0,2}$", array( "", "ab", "ba", "abab", "abba", "baab", "baba" ) ),
            array( "^(ab|ba){1,2}$", array( "ab", "ba", "abab", "abba", "baab", "baba" ) ),
            array( "^(ab|ba){2}$", array( "abab", "abba", "baab", "baba" ) ),
            array( "^hello?$", array( "hell", "hello" ) ),
            array( "^(0|1){3}$", array( "000", "001", "010", "011", "100", "101", "110", "111" ) ),
            array( "^[1-9][0-9]{0,1}$", array_map( function( $input ) { return (string)$input; }, range( 1, 99 ) ) ),
            array( '^\n$', array( "\n" ) ),
            array( '^\r$', array( "\r" ) ),
            array( '^\t$', array( "\t" ) ),
            array( '^[\\\\\\]a\\-]$', array( "\\", "]", "a", "-" ) ), //the regex is actually '^[\\\]a\-]$' after PHP string parsing
            array( '^[\\n-\\r]$', array( chr( 10 ), chr( 11 ), chr( 12 ), chr( 13 ) ) ),
        );
    }

    /**
     * @dataProvider dataProviderForTestSimpleRead
     */

    function testSimpleRead( $regex_string, $expected_matches_array )
    {
        $lexer = new RegexCompiler_Lexer();
        $actualy_matches_array = $lexer->lex( $regex_string )->getMatches();
        sort( $actualy_matches_array );
        sort( $expected_matches_array );
        $this->assertSame( $expected_matches_array, $actualy_matches_array );
    }

}

?>

I would like to build an MatchIterator class that could handle infinite lists as well as one that would randomly generate matches from the regex. I'd also like to look into building regex from a match set as a way of optimizing lookups or compressing data.

Hyacinth answered 5/10, 2011 at 4:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.