String generation with regex like criteria
Asked Answered
B

2

15

I wonder wether it is feasible to implement an optimal string generator Class meeting the following second thought requirements:


I don't feel comfortable with regular expression: I cannot come up with a starting piece of code but I just think of a naive implementation using a TList as a base class and use a filter (Regex) against "brute force" generated string.

What are the other optimal alternatives ?


  • Ordering: First by length (shortest first), then lexicographically.
  • Specification of the range of characters to be used in the generation: All printable or any possible combination of [A-Z], [a-z], numbers, special symbols, and eventually space (regex ?).
  • String Length bounded with a given Min/Max.
  • Space of search constrained with bounds: Start string an End string with possibility of filtering (regex ?)

Last Edit

To begin with, I rephrased the header using regex like instead of regex.

I am considering to revise the first requirement as it is an open door which may lead to untractable issue.

I need suggestions and help for the correct wording.

Second thought requirements edit done. Still open to suggestion for refinement.

Bohunk answered 5/4, 2012 at 17:25 Comment(12)
Not being intimately familiar with Delphi, I can only speak from a general viewpoint. To my mind, the best way to do this would be to parse a regex into a graph representing its equivalent state machine (wikipedia should be able to point you in the right direction there). From there, words can be generated by performing a depth-first traversal of said graph (keeping in mind that it is very likely cyclic, of course). The downside here is that we can't take advantage of a languages built in regex support.Actuary
+1: Very interesting. It's a truly generative approach as opposted to my highly unefficient generator/tester way. To my opinion there is no downside at all: built-in regex support are for evaluation and that mislead me in devising a solution. You can consider to migrate your comment as an answer, I find it acceptable. Thank you.Bohunk
This seems like a dumb way to use regular expressions. I think that a simplified generator expression system that generates a generator-object, that might be somewhat similar to some regex features (supporting [A-Z]. and [A-Z]* (to within a fixed limit) alone, would be sufficient.Wedgwood
@Warren P: Yes, it's very naive I must confess. I think that it's wise to consider your suggestion as a starting point given that I don't have hands on experience with Regex but later on I want to benefit from regex possibilities to its full extent.Bohunk
Regexes will just open doors you don't need opened. A simple pattern language that borrows regex syntax might have a slight advantage in readability, but as the semantics are different, -- note my comment shows that wildcards in Regexs need fixed limits to be useful in your context!Wedgwood
Wildcards do present an interesting issue for generating strings. For example, what is the first string (lexicographically) that can be generated by a*b? Consider that ab comes before b, and aab before ab, etc. Inductively, we can reason that the first string is an infinite repetition of a followed by a single b. Clearly we can't represent this, and so we have to set a fixed (i.e. arbitrary) limit on the expansion of wildcards.Actuary
Ok. My understandig of the comment thread sofar is that wilcards in regex introduce unneeded complexity which may even lead to an untractable issue. I consider to review my requirements and do an edit accordingly otherwise the post as it is will be deemed to be closed I am afraid: I need help for the correct wording. Thanks in advance.Bohunk
Should it be a subset of regex or an new pattern language regex like designed from scratch ?Bohunk
Regarding the order: As noted by jpm, the lex-order is not a well-ordering (there may be infinite decreasing subsequences). Even for well-ordered languages, the order is not in general indexable by natural numbers. The language b*a|c consists of the following elements in increasing order: a < ba < bba < bbba < ... < c, where the last element c cannot be indexed. Instead, I suggest using "shortlex order": First order by length (shortest first), then lexicographically (with strings of the same length).Opsonin
Depending on what regex flavour you would use, a finite automaton may not suffice to generate all matching strings. It suffices for real regular expressions (in the sense of computer science), but look-around and backreferences (provided in most extended regex engines) may prevent you from using finite automata.Opsonin
If you want to revise your first requirement, you first need to know the requirement. Which sets of strings are you looking for? Every regex describes a set of strings, but which of these sets are you really interested in? Once you know that and can describe it, we can try and help you with the formalities.Opsonin
I've done an edit to the requirements. Thank you for all the inputs made sofar.Bohunk
G
3

I'd do this by constructing the minimum Deterministic Finite Automaton for the language. If you are starting with a regex, this can be done automatically by Thompson's Construction followed by the Subset Construction and minimization. See this description for example.

With a DFA in hand, you can use something like this algorithm:

Let P = { < START, [""] > } be a set of pairs <State, list of strings>
for n = 0, 1, ... Max
  Let P' = {} be a new set 
  while P is not empty 
    Remove the pair <s, L> from P 
    For each transition s -- c --> t in alpha order of c
      if t is an accepting state,
          output l + c for each string l in L
      put <t, L + c> in P' (** i.e. append c to each string in L)
  end
  Set P = P'
end

Note that the step marked ** needs to be true set insertion, as duplicates can easily crop up.

This is a core algorithm. P can grow exponentially with output length, but this is just the price of tracking all possibilities for a future output string. The order/size/space constraints you mentioned can be ensured by maintaining sorted order in the lists L and by cutting off the search when resource limits are reached.

Edit

Here is a toy Java example where I've hard coded the DFA for simple binary floating point literals with optional minus sign. This uses a slightly different scheme than the pseudocode above to get strict sorted order of output and to accomodate character ranges.

import java.util.Comparator;
import java.util.TreeSet;

public class Test{

    public static class DFA {

        public static class Transition  {

            final int to;
            final char lo, hi; // Character range.

            public Transition(int to, char lo, char hi) {
                this.to = to;
                this.lo = lo;
                this.hi = hi;
            }

            public Transition(int to, char ch) {
                this(to, ch, ch);
            }
        }

        // transitions[i] is a vector of transitions from state i.
        final Transition [] [] transitions;

        // accepting[i] is true iff state i is accepting
        final boolean [] accepting;

        // Make a fresh immutable DFA.
        public DFA(Transition [] [] transitions, boolean [] accepting) {
            this.transitions = transitions;
            this.accepting = accepting;
        }

        // A pair is a DFA state number and the input string read to get there.
        private static class Pair {
            final int at;
            final String s;

            Pair(int at, String s) {
                this.at = at;
                this.s = s;
            }
        }

        // Compare pairs ignoring `at` states, since 
        // they are equal iff the strings are equal.
        private Comparator<Pair> emitOrder = new Comparator<Pair>() {
            @Override
            public int compare(Pair a, Pair b) {
                return a.s.compareTo(b.s);
            }
        };

        // Emit all strings accepted by the DFA of given max length.
        // Output is in sorted order.
        void emit(int maxLength) {
            TreeSet<Pair> pairs = new TreeSet<Pair>(emitOrder);
            pairs.add(new Pair(0, ""));
            for (int len = 0; len <= maxLength; ++len) {
                TreeSet<Pair> newPairs = new TreeSet<Pair>(emitOrder);
                while (!pairs.isEmpty()) {
                    Pair pair = pairs.pollFirst();
                    for (Transition x : transitions[pair.at]) {
                        for (char ch = x.lo; ch <= x.hi; ch++) {
                            String s = pair.s + ch;
                            if (newPairs.add(new Pair(x.to, s)) && accepting[x.to]) {
                                System.out.println(s);
                            }
                        }
                    }
                }
                pairs = newPairs;
            }
        }
    }

    // Emit with a little DFA for floating point numbers.
    public void run() {
        DFA.Transition [] [] transitions = {
            {   // From 0
                new DFA.Transition(1, '-'),
                new DFA.Transition(2, '.'),
                new DFA.Transition(3, '0', '1'),
            },
            {   // From 1
                new DFA.Transition(2, '.'),
                new DFA.Transition(3, '0', '1'),
            },
            {   // From 2
                new DFA.Transition(4, '0', '1'),
            },
            {   // From 3
                new DFA.Transition(3, '0', '1'),
                new DFA.Transition(4, '.'),
            },
            {   // From 4
                new DFA.Transition(4, '0', '1'),
            }  
        };
        boolean [] accepting = { false, false, false, true, true };
        new DFA(transitions, accepting).emit(4);
    }

    public static void main (String [] args) {
        new Test().run();
    }
}
Glabrescent answered 30/7, 2012 at 19:21 Comment(1)
I have just found this interesting PHP project on Github and can't help sharing it here: ReverseRegex.Bohunk
C
4

Old question, but no one has answered it, the bounty is still active and I already have a solution ready, so here is a possible answer:

I once wrote a little program that does this. It is however in C++/Qt (although I write almost all my programs in Delphi, that one is in C++), doesn't support Unicode and makes no order guarantees

It works as follow:

  1. All ? {} + * | () operators are expanded (to a maximal limit), so that only character classes and backreferences remain.

    e.g. [a-c]+|t*|([x-z]){2}foo\1|(a|b)(t|u) becomes [a-c]|[a-c][a-c]|[a-c][a-c][a-c]|[a-c][a-c][a-c][a-c]||t|tt|tt|ttt|ttt|([x-z][x-z])foo\1|at|au|bt|bu

    (the | in latter expression are just notation, the program keeps each alternative subregex in a list)

  2. Backreferences to multiple characters are replaced by backreferences to single characters.

    e.g. the expression above becomes [a-c]|[a-c][a-c]|[a-c][a-c][a-c]|[a-c][a-c][a-c][a-c]||t|tt|tt|ttt|ttt|([x-z])([x-z])foo\1\2|at|au|bt|bu

    Now each alternative subregex matches a fixed length string.

  3. For each of the alternatives, all combinations of picking characters from the classes are printed:

    e.g. the expression above becomes a|b|c|aa|ba|..|cc|aaa|baa|...|ccc|aaaa|...|cccc||t|tt|tt|ttt|ttt|xxfooxx|yxfooyx|...|zzfoozz|at|au|bt|bu

You probably can add shortlex order guarantees as follow:

  1. Sort the characters in the classes alphabetically

  2. Sort the alternatives obtained in step 2. above for length

    (there are exponentially many alternatives, but usually their count is negligible compared to the count of resulting strings )

  3. Sort/exchange the character classes and backreferences, so that every reference points backward

  4. Enumerate the possible strings for a single fixed-length alternative as before, but start at the last character class, instead at the first to get an alphabetically ordering.

    (this wouldn't work, if there were any backreferences pointing forward)

  5. If there are several alternatives of the same length, enumerate them in "parallel", compare their current strings and print the alphabetically lowest. (i.e. merge the already sorted lists for each alternative.)

    This can be optimized, e.g. by detecting distinct prefixes and safe character classes in the suffix which can be enumerated without affecting the ordering. (e.g. a[a-z]|b[a-z] has distinct prefixes and the [a-z] can be enumerated without any comparisons)

Clearheaded answered 25/7, 2012 at 17:35 Comment(0)
G
3

I'd do this by constructing the minimum Deterministic Finite Automaton for the language. If you are starting with a regex, this can be done automatically by Thompson's Construction followed by the Subset Construction and minimization. See this description for example.

With a DFA in hand, you can use something like this algorithm:

Let P = { < START, [""] > } be a set of pairs <State, list of strings>
for n = 0, 1, ... Max
  Let P' = {} be a new set 
  while P is not empty 
    Remove the pair <s, L> from P 
    For each transition s -- c --> t in alpha order of c
      if t is an accepting state,
          output l + c for each string l in L
      put <t, L + c> in P' (** i.e. append c to each string in L)
  end
  Set P = P'
end

Note that the step marked ** needs to be true set insertion, as duplicates can easily crop up.

This is a core algorithm. P can grow exponentially with output length, but this is just the price of tracking all possibilities for a future output string. The order/size/space constraints you mentioned can be ensured by maintaining sorted order in the lists L and by cutting off the search when resource limits are reached.

Edit

Here is a toy Java example where I've hard coded the DFA for simple binary floating point literals with optional minus sign. This uses a slightly different scheme than the pseudocode above to get strict sorted order of output and to accomodate character ranges.

import java.util.Comparator;
import java.util.TreeSet;

public class Test{

    public static class DFA {

        public static class Transition  {

            final int to;
            final char lo, hi; // Character range.

            public Transition(int to, char lo, char hi) {
                this.to = to;
                this.lo = lo;
                this.hi = hi;
            }

            public Transition(int to, char ch) {
                this(to, ch, ch);
            }
        }

        // transitions[i] is a vector of transitions from state i.
        final Transition [] [] transitions;

        // accepting[i] is true iff state i is accepting
        final boolean [] accepting;

        // Make a fresh immutable DFA.
        public DFA(Transition [] [] transitions, boolean [] accepting) {
            this.transitions = transitions;
            this.accepting = accepting;
        }

        // A pair is a DFA state number and the input string read to get there.
        private static class Pair {
            final int at;
            final String s;

            Pair(int at, String s) {
                this.at = at;
                this.s = s;
            }
        }

        // Compare pairs ignoring `at` states, since 
        // they are equal iff the strings are equal.
        private Comparator<Pair> emitOrder = new Comparator<Pair>() {
            @Override
            public int compare(Pair a, Pair b) {
                return a.s.compareTo(b.s);
            }
        };

        // Emit all strings accepted by the DFA of given max length.
        // Output is in sorted order.
        void emit(int maxLength) {
            TreeSet<Pair> pairs = new TreeSet<Pair>(emitOrder);
            pairs.add(new Pair(0, ""));
            for (int len = 0; len <= maxLength; ++len) {
                TreeSet<Pair> newPairs = new TreeSet<Pair>(emitOrder);
                while (!pairs.isEmpty()) {
                    Pair pair = pairs.pollFirst();
                    for (Transition x : transitions[pair.at]) {
                        for (char ch = x.lo; ch <= x.hi; ch++) {
                            String s = pair.s + ch;
                            if (newPairs.add(new Pair(x.to, s)) && accepting[x.to]) {
                                System.out.println(s);
                            }
                        }
                    }
                }
                pairs = newPairs;
            }
        }
    }

    // Emit with a little DFA for floating point numbers.
    public void run() {
        DFA.Transition [] [] transitions = {
            {   // From 0
                new DFA.Transition(1, '-'),
                new DFA.Transition(2, '.'),
                new DFA.Transition(3, '0', '1'),
            },
            {   // From 1
                new DFA.Transition(2, '.'),
                new DFA.Transition(3, '0', '1'),
            },
            {   // From 2
                new DFA.Transition(4, '0', '1'),
            },
            {   // From 3
                new DFA.Transition(3, '0', '1'),
                new DFA.Transition(4, '.'),
            },
            {   // From 4
                new DFA.Transition(4, '0', '1'),
            }  
        };
        boolean [] accepting = { false, false, false, true, true };
        new DFA(transitions, accepting).emit(4);
    }

    public static void main (String [] args) {
        new Test().run();
    }
}
Glabrescent answered 30/7, 2012 at 19:21 Comment(1)
I have just found this interesting PHP project on Github and can't help sharing it here: ReverseRegex.Bohunk

© 2022 - 2024 — McMap. All rights reserved.