a regular expression generator for number ranges
Asked Answered
C

9

18

I checked on the stackExchange description, and algorithm questions are one of the allowed topics. So here goes.

Given an input of a range, where begin and ending numbers have the same number of digits (say, 2, 3, or 4), I want to write code to generate a set of regular expressions which, when checked against a number in turn, tell me whether that number is in the original range.

For example: if the range is 145-387, then 146, 200, and 280 would all match one of the regular expressions generated, and 144, 390 (used to say 290), and 445 (used to say 345) would not.

I have been thinking the result would be a list of regular expressions like:

14[5-9]             // match 145-149
1[5-9]0-9]          // 150-199
2[0-9][0-9]         // 200-299
3[0-7][0-9]         // 300-379
38[0-7]             // 380-387

and then software checking the number would test to see if the 3-digit code being tested matched any of these.

So what's the best way to generate the set of expressions?

The latest (in a series) that I've come up with is to:

  1. determine the first digit at which the two range numbers differ (1145-1158, first different digit is the 3rd)
  2. for the different digits, determine if their first digits differ by more than one -- if so, the range for that gets its own regex (200-299 in our example)
  3. to get lower ranges: for each other digit: prefix by the first digit(s) from the beginning of the range, increment the digit by one, pad with 0s to the same length, and pair with a number that has 9 in the digit place and all the padded places. In our example, increment 4 to 5, pad to get 150, generate the regex to handle 150-199.
  4. to get higher ranges: for each other digit: prefix with first digit(s) from end of range, decrement digit by one, pad rest with 0s, pair with a number with 9s in all the padded 0 places and the decremented digit. In our example, the regex to handle 300-379.

Am I missing something? There are details even in the above that I'm glossing over, it seems like something that would benefit from an algorithmic sword slashing through the details. But the other things I've come up with are even messier than this.

Carboloy answered 4/11, 2015 at 1:9 Comment(13)
Regular expressions are not a panacea for every problem, and this might be one case where you would be better off just checking the range in your application code.Sailmaker
regular-expressions.info/numericranges.htmlGayla
@user3386109 yes that is what I meant. I'll edit and correct, thanks.Carboloy
@TimBiegeleisen If you'll read the post carefully, you may notice that I never said regular expressions were any kind of panacea, nor did I ask whether this was the best way to check ranges. The general problem includes a requirement for general checking against numbers, not just ranges but other things, that can be set in database values so that no code needs to change. Of course there are other ways to do it, but I'm asking about this one. "There was a student of computer science who had a problem, and he decide to solve it with regular expressions. Now he has 2 problems."Carboloy
@willywonka_dailyblah Thanks for the link; unfortunately the post does not deal with my problem, and spends most of its time on the problem of comparing strings of digits that are not equal in length (which my problem does not have). It points out one or two things that are useful, but I had already determined those. Still looking for that magic bullet!Carboloy
While I think this is doomed for failure, it may be good to define what you mean by number: positive integers only? Or are negative integers also allowed? What about floating point numbers? And, since this is computer code, hexadecimal or octal numbers?Gabriellagabrielle
Depends on what features of regular expressions are allowed. For instance, if look arounds are allowed, they may cut down the average number of steps in the algorithm that generates regex's as well as the number of regex's generatedGrange
such code is easy to write, just follow the same algorithm you'd do manually, create an expression that copes with numbers filling to closer 10 multiplier, then 100, then 1000... then each of full 1000 (or whatever biggest fits), then same thing but opposite direction. If all have same number of digits then problem is fairly easy to solve. Still, if regular expressions are the way to go, I doubt. If you deal with check format and than range problem -> use regular expressions to check format, then pars it to number and use numerical methods to check range.Winze
@sp00m that site gives me output, which I already have. Judging from the time it takes to generate the output, I gather they don't have a better algorithm than I already have. but thanks.Carboloy
Should that "345" read "435"? Because 345 is in the range [145, 387].Epistyle
I'd recommend a test-driven development approach for thisShoreline
presumably you know the allowed range in order to create the regex pattern for it - so why not just use "less than x and greater than y" logic?Kef
in case it helps, I created a node.js / javascript library that does this npmjs.com/package/to-regex-range. It will be used in the next release of micromatchRoy
B
17

Here's my solution and an algorithm with complexity O(log n) (n is the end of the range). I believe it is the simplest one here:

Basically, split your task into these steps:

  1. Gradually "weaken" the start of the range.
  2. Gradually "weaken" the end of the range.
  3. Merge those two.

By "weaken", I mean finding the end of range that can be represented by simple regex for this specific number, for example:

145 -> 149,150 -> 199,200 -> 999,1000 -> etc.

Here's a backward one, for the end of the range:

387 -> 380,379 -> 300,299 -> 0

Merging would be the process of noticing the overlap of 299->0 and 200->999 and combining those into 200->299.

In result, you would get this set of numbers (first list intact, second one inverted:

145, 149, 150, 199, 200, 299, 300, 379, 380, 387

Now, here is the funny part. Take the numbers in pairs, and convert them to ranges:

145-149, 150-199, 200-299, 300-379, 380-387

Or in regex:

14[5-9], 1[5-9][0-9], 2[0-9][0-9], 3[0-7][0-9], 38[0-7]

Here's how the code for the weakening would look like:

public static int next(int num) {
    //Convert to String for easier operations
    final char[] chars = String.valueOf(num).toCharArray();
    //Go through all digits backwards
    for (int i=chars.length-1; i>=0;i--) {
        //Skip the 0 changing it to 9. For example, for 190->199
        if (chars[i]=='0') {
            chars[i] = '9';
        } else { //If any other digit is encountered, change that to 9, for example, 195->199, or with both rules: 150->199
            chars[i] = '9';
            break;
        }
    }

    return Integer.parseInt(String.valueOf(chars));
}

//Same thing, but reversed. 387 -> 380, 379 -> 300, etc
public static int prev(int num) {
    final char[] chars = String.valueOf(num).toCharArray();
    for (int i=chars.length-1; i>=0;i--) {
        if (chars[i] == '9') {
            chars[i] = '0';
        } else {
            chars[i] = '0';
            break;
        }
    }

    return Integer.parseInt(String.valueOf(chars));
}

The rest is technical details and is easy to implement. Here's an implementation of this O(log n) algorithm: https://ideone.com/3SCvZf

Oh, and by the way, it works with other ranges too, for example for range 1-321654 the result is:

[1-9]
[1-9][0-9]
[1-9][0-9][0-9]
[1-9][0-9][0-9][0-9]
[1-9][0-9][0-9][0-9][0-9]
[1-2][0-9][0-9][0-9][0-9][0-9]
3[0-1][0-9][0-9][0-9][0-9]
320[0-9][0-9][0-9]
321[0-5][0-9][0-9]
3216[0-4][0-9]
32165[0-4]

And for 129-131 it's:

129
13[0-1]
Brainard answered 5/11, 2015 at 8:3 Comment(10)
I had finally arrived at this myself, and I agree it is the desirable solution. I had worked it out in pseudocode about to this extent, and was waiting until today so I could post actual Java for it. Thanks.Carboloy
Well, let me correct that -- now that I've studied your code further, I had arrived at this general algorithm and had worked out part of it in pseudocode. I haven't figured out yet how you're collapsing what would be multiple regular expressions in my code into single ones; I'll have to study that more. This is elegant; thanks.Carboloy
Basically, I'm moving from the beginning of the range replacing digits one by one from the end with zeroes, until the resulting number > end of range. Then I do the same thing backwards for the end of range, but now going backwards until the resulting number is < start. Then I reverse the second list and append it to first one. It's not difficult to prove that both generated sequences will have odd number of entries, and when combined - will produce full set of mini-ranges. I do not know if this explanation made it clearer, as it's very difficult to explain this non-standard algorithm.Brainard
There's a bug: 1765-1898 produces an 'extra' and incorrect regular expression of "1[8-7][0-9][0-9]". I think it's because rightBounds() uses the range start, am thinking it should use the end of the range already determined by leftBounds(). Am working up an alternative, will be back.Carboloy
You are right, in main method, the rightBounds method call should be changed to pairs.addAll(rightBounds(pairs.get(pairs.size()-1), end));. I've also updated the code at ideone link.Brainard
I don't know that that fixes it, and I can't get to ideone from work. The starting numbers 1699-1898 also give incorrect results with your original code. I'm going to try to separate out some pieces you combine so the code can step through the numbers it is generating in a way that is easier to read. Will be back, perhaps we should set up a chat room.Carboloy
Yeah, the implementation was a quick draft, here's some prettified version which seems to work well for all the cases: ideone.com/3SCvZf . It is also much easier to understand and debug (I guess).Brainard
Let us continue this discussion in chat.Carboloy
well, my version is almost there; the code for generating ranges based on strengthening the start number and that for generating ranges based on weakening the ending number work, but there is a middle range that sometimes needs to be filled in. That isn't working in general, still trying to conceptualize the three areas correctly. Chat is open.Carboloy
Great algorithm. I translated your solution in Python (3.5) here: ideone.com/QwWyAqStertor
C
6

I've finally arrived at the following. The overall idea is to start with the beginning of the range, produce a regular expression that will match from that up to but not including the next multiple of 10, then for hundreds, etc. until you have matched things up to the end of the range; then start with the end of the range and work downwards, replacing increasing numbers of digits with 0s to match against similar numbers of 9s, to match the specific end-of-range. Then generate one regular expression for the part of the range if they don't already cover it all.

Special note should be taken of bezmax's routine to convert two numbers to the regular expression that will match them - MUCH easier than dealing with strings or character arrays directly, I think.

Anyway, here it is:

package numbers;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;

/**
 * Has methods for generating regular expressions to match ranges of numbers.
 */
public class RangeRegexGenerator
{
  public static void main(String[] args)
  {
    RangeRegexGenerator rrg = new RangeRegexGenerator();

//    do
//    {
//      Scanner scanner = new Scanner(System.in);
//      System.out.println("enter start, <return>, then end and <return>");
//      int start = scanner.nextInt();
//      int end = scanner.nextInt();
//      System.out.println(String.format("for %d-%d", start, end));

      List<String> regexes = rrg.getRegex("0015", "0213");
      for (String s: regexes) { System.out.println(s); }
//    } 
//    while(true);
  }

  /**
   * Return a list of regular expressions that match the numbers
   * that fall within the range of the given numbers, inclusive.
   * Assumes the given strings are numbers of the the same length,
   * and 0-left-pads the resulting expressions, if necessary, to the
   * same length. 
   * @param begStr
   * @param endStr
   * @return
   */
  public List<String> getRegex(String begStr, String endStr)
  {
      int start = Integer.parseInt(begStr);
      int end   = Integer.parseInt(endStr);
      int stringLength = begStr.length();
      List<Integer> pairs = getRegexPairs(start, end);
      List<String> regexes = toRegex(pairs, stringLength);
      return regexes;
  }

  /**
   * Return a list of regular expressions that match the numbers
   * that fall within the range of the given numbers, inclusive.
   * @param beg
   * @param end
   * @return
   */
  public List<String> getRegex(int beg, int end)
  {
    List<Integer> pairs = getRegexPairs(beg, end);
    List<String> regexes = toRegex(pairs);
    return regexes;
  }

  /**
   * return the list of integers that are the paired integers
   * used to generate the regular expressions for the given
   * range. Each pair of integers in the list -- 0,1, then 2,3,
   * etc., represents a range for which a single regular expression
   * is generated.
   * @param start
   * @param end
   * @return
   */
  private List<Integer> getRegexPairs(int start, int end)
  {
      List<Integer> pairs = new ArrayList<>();

      ArrayList<Integer> leftPairs = new ArrayList<>();
      int middleStartPoint = fillLeftPairs(leftPairs, start, end);
      ArrayList<Integer> rightPairs = new ArrayList<>();
      int middleEndPoint = fillRightPairs(rightPairs, middleStartPoint, end);

      pairs.addAll(leftPairs);
      if (middleEndPoint > middleStartPoint)
      {
        pairs.add(middleStartPoint);
        pairs.add(middleEndPoint);
      }
      pairs.addAll(rightPairs);
      return pairs;
  }

  /**
   * print the given list of integer pairs - used for debugging.
   * @param list
   */
  @SuppressWarnings("unused")
  private void printPairList(List<Integer> list)
  {
    if (list.size() > 0)
    {
      System.out.print(String.format("%d-%d", list.get(0), list.get(1)));
      int i = 2;
      while (i < list.size())
      {
        System.out.print(String.format(", %d-%d", list.get(i), list.get(i + 1)));
        i = i + 2;
      }
      System.out.println();
    }
  }

  /**
   * return the regular expressions that match the ranges in the given
   * list of integers. The list is in the form firstRangeStart, firstRangeEnd, 
   * secondRangeStart, secondRangeEnd, etc.
   * @param pairs
   * @return
   */
  private List<String> toRegex(List<Integer> pairs)
  {
    return toRegex(pairs, 0);
  }

  /**
   * return the regular expressions that match the ranges in the given
   * list of integers. The list is in the form firstRangeStart, firstRangeEnd, 
   * secondRangeStart, secondRangeEnd, etc. Each regular expression is 0-left-padded,
   * if necessary, to match strings of the given width.
   * @param pairs
   * @param minWidth
   * @return
   */
  private List<String> toRegex(List<Integer> pairs, int minWidth)
  {
    List<String> list = new ArrayList<>();
    String numberWithWidth = String.format("%%0%dd", minWidth);
    for (Iterator<Integer> iterator = pairs.iterator(); iterator.hasNext();)
    {
      String start = String.format(numberWithWidth, iterator.next()); // String.valueOf(iterator.next());
      String end = String.format(numberWithWidth, iterator.next());

      list.add(toRegex(start, end));
    }
    return list;
  }

  /**
   * return a regular expression string that matches the range
   * with the given start and end strings.
   * @param start
   * @param end
   * @return
   */
  private String toRegex(String start, String end)
  {
    assert start.length() == end.length();

    StringBuilder result = new StringBuilder();

    for (int pos = 0; pos < start.length(); pos++)
    {
      if (start.charAt(pos) == end.charAt(pos))
      {
        result.append(start.charAt(pos));
      } else
      {
        result.append('[').append(start.charAt(pos)).append('-')
            .append(end.charAt(pos)).append(']');
      }
    }
    return result.toString();
  }

  /**
   * Return the integer at the end of the range that is not covered
   * by any pairs added to the list.
   * @param rightPairs
   * @param start
   * @param end
   * @return
   */
  private int fillRightPairs(List<Integer> rightPairs, int start, int end)
  {
    int firstBeginRange = end;    // the end of the range not covered by pairs
                                  // from this routine.
    int y = end;
    int x = getPreviousBeginRange(y);

    while (x >= start)
    {
      rightPairs.add(y);
      rightPairs.add(x);
      y = x - 1;
      firstBeginRange = y;
      x = getPreviousBeginRange(y);
    }
    Collections.reverse(rightPairs);
    return firstBeginRange;
  }

  /**
   * Return the integer at the start of the range that is not covered
   * by any pairs added to its list. 
   * @param leftInts
   * @param start
   * @param end
   * @return
   */
  private int fillLeftPairs(ArrayList<Integer> leftInts, int start, int end)
  {
    int x = start;
    int y = getNextLeftEndRange(x);

    while (y < end)
    {
      leftInts.add(x);
      leftInts.add(y);
      x = y + 1;
      y = getNextLeftEndRange(x);
    }
    return x;
  }

  /**
   * given a number, return the number altered such
   * that any 9s at the end of the number remain, and
   * one more 9 replaces the number before the other
   * 9s.
   * @param num
   * @return
   */
  private int getNextLeftEndRange(int num)
  {
    char[] chars = String.valueOf(num).toCharArray();
    for (int i = chars.length - 1; i >= 0; i--)
    {
      if (chars[i] == '0')
      {
        chars[i] = '9';
      } else
      {
        chars[i] = '9';
        break;
      }
    }

    return Integer.parseInt(String.valueOf(chars));
  }

  /**
   * given a number, return the number altered such that
   * any 9 at the end of the number is replaced by a 0,
   * and the number preceding any 9s is also replaced by
   * a 0.
   * @param num
   * @return
   */
  private int getPreviousBeginRange(int num)
  {
    char[] chars = String.valueOf(num).toCharArray();
    for (int i = chars.length - 1; i >= 0; i--)
    {
      if (chars[i] == '9')
      {
        chars[i] = '0';
      } else
      {
        chars[i] = '0';
        break;
      }
    }

    return Integer.parseInt(String.valueOf(chars));
  }
}

This one is correct as far as I've been able to test it; the one posted by bezmax did not quite work, though he had the right idea (that I also came up with) for an overall algorithm, and a major implementation detail or two that were helpful, so I'm leaving the 'answer' checkmark on his response.

I was a little surprised at the amount of interest this generated, though not as much as by just how complex the problem turned out to be.

Carboloy answered 5/11, 2015 at 21:4 Comment(6)
I have adapted my recursive solution in Python to match your/bezmax's implementation. The resulting alternative implementation of getRegexPairs needs less code and is easier to prove correct, I think. (see my second answer)Clarendon
for (String s: regexes) { System.out.print(s+"|"); }Higginbotham
There are many levels of complexity, almost impossible to fathom regexformat.com/scrn8/MegNrange.jpgSpecs
The set of examples in your link includes negative numbers (which are not mentioned in the original problem), and most of them have ranges where the start and end numbers in the range have different numbers of digits, which the original problem explicitly stated was NOT part of the input, and which makes the problem more difficult.Carboloy
That tool has a check box if it's floating point range or not. Integer ranges are found if the box isn't checked. Also, for floats, the output defaults to the maximum number of decimal places in either range min/max. The tecnique to accomplish this is done in stages which includes factoring. Just giving a heads up on the complexity required to do the full job..There might be other tools that can do this, but I haven't found one.Specs
So, you've linked to a tool that solves a problem more complicated than the one in the question. And are telling us how complicated it is.Carboloy
C
3

Here is a recursive solution in python, which works for an arbitrary range of positive numbers. The idea is to divide the range into three sub-ranges:

  • from start to the next multiple of 10 (if start is not already a multiple of 10)
  • from the last multiple of 10 to end (if end is not already a multiple of 10)
  • the range between these two multiples of 10 can be handled recursivle by taking off the last digit and adding the regular expression [0-9] to all generated regular expressions afterwards

The code below even optimizes ranges of single values like [1-1] to 1. The function to call is genRangeRegex (start is inclusive, end is exclusive):

def regexRangeDigits(start,stop):
  if start == stop:
    return str(start)
  return '[%d-%d]' % (start,stop)

# generate list of regular expressions for the number range [start,end[
def genRangeRegex(start, end):
  if start <= 0:
    raise ValueError('only ranges of positive numbers supported')

  print 'getting regex list for range [%d,%d[' % (start,end)
  if start >= end:
    return []

  digitsStart = str(start)
  digitsEnd   = str(end)
  lastDigitStart = start%10

  if start//10 == (end-1)//10: # integer division
    lastDigitStop = (end-1)%10
    regexAll = digitsStart[:-1] + regexRangeDigits(lastDigitStart,lastDigitStop)
    print '  regexAll   = %s' % regexAll
    return [regexAll]

  regexListStart = [] # at most one regular expression for going up to first multiple of 10
  if lastDigitStart != 0:
    regexStart = digitsStart[:-1] + regexRangeDigits(lastDigitStart,9)
    print '  regexStart = %s' % regexStart
    regexListStart.append(regexStart)

  regexListEnd = [] # at most one regular expression for going up from last multiple of 10
  lastDigitEnd = end%10
  if lastDigitEnd != 0:
    regexEnd = digitsEnd[:-1] + regexRangeDigits(0,lastDigitEnd-1)
    print '  regexEnd   = %s' % regexEnd
    regexListEnd.append(regexEnd)

  regexListMidTrunc = genRangeRegex((start+9)//10, end//10)
  regexListMid = [r+'[0-9]' for r in regexListMidTrunc]

  return regexListStart + regexListMid + regexListEnd

And here an example output how the function works:

>>> genRangeRegex(12,231)
getting regex list for range [12,231[
  regexStart = 1[2-9]
  regexEnd   = 230
getting regex list for range [2,23[
  regexStart = [2-9]
  regexEnd   = 2[0-2]
getting regex list for range [1,2[
  regexAll   = 1
['1[2-9]', '[2-9][0-9]', '1[0-9][0-9]', '2[0-2][0-9]', '230']
Clarendon answered 4/11, 2015 at 9:10 Comment(2)
Fails for genRangeRegex(129,131); (no output at all)Coeval
@Coeval when i execute the above code with Python 2.7.10 and call genRangeRegex(129,131) I get the output ['129', '130']; note that I designed my function to have an exclusive upper bound.Clarendon
C
3

You cannot cover your requirement with Character Groups only. Imagine the Range 129-131. The Pattern 1[2-3][1-9] would also match 139 which is out of range.

So in this example you need to change the last group to something else: 1[2-3](1|9). You can now find this effect as well for the tens and hundrets, leading to the problem that aapattern that basically represents each valid number as a fixed sequence of numbers is the only working solution. (if you don't want an algorithm that needs to track overflows in order to decide whether it should use [2-8] or (8,9,0,1,2))

if you anyway autogenerate the pattern - keep it simple:

128-132

can be written as (I left out the non-matching group addition ?: for better readability)

(128|129|130|131|132)

algorithm should be ovious, a for, an array, string concatenation and join.

That would already work as expected, but you can also perform some "optimization" on this if you like it more compact:

(128|129|130|131|132) <=>
1(28|29|30|31|32) <=>
1(2(8|9)|3(0|1|2))

more optimization

1(2([8-9])|3([0-2]))

Algorithms for the last steps are out there, look for factorization. An easy way would be to push all the numbers to a tree, depending on the character position:

1
  2
    8
    9
  3
    0
    1
    2

and finally iterate over the three and form the pattern 1(2(8|9)|3(0|1|2)). As a last step, replace anything of the pattern (a|(b|)*?c) with [a-c]

Same goes for 11-29:

11-29 <=>
(11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29) <=>   
(1(1|2|3|4|5|7|8|9)|2(1|2|3|4|5|7|8|9)) <=>
(1([1-9])|2([1-9]) 

as an addition you now can proceed with the factorization:

(1([1-9])|2([1-9]) <=>
(1|2)[1-9] <=>
[1-2][1-9]
Coeval answered 4/11, 2015 at 13:50 Comment(4)
As far as I understood from the requirement, 129-131 is a proper case and should produce a list of two regular expressions: "129", "13[0-1]".Brainard
@Brainard yes, it should be valid and is in the example. From the requirement i cannot derive, that 2 expressions should be generated?Coeval
yes, check this part of the original post: I have been thinking the result would be a list of regular expressions like: ..., notice the list partBrainard
@Brainard I have been thinking does not mean has to be.Coeval
C
2

[Hint: somehow the idea of applying recursion as presented in my first answer (using Python) did not reach the OP, so here it is again in Java. Note that for a recursive solution it is often easier to prove correctness.]

The key observation to use recursion is that ranges starting with a number ending in 0 and ending with a number ending in 9 are covered by digit patterns that all end in [0-9].

20-239 is covered by [2-9][0-9], 1[0-9][0-9], 2[0-3][0-9]

When taking off the last digit of start and end of the range the resulting range is covered by the same digit patterns, except for the missing trailing [0-9]:

20-239 is covered by [2-9][0-9], 1[0-9][0-9], 2[0-3][0-9]
2 -23  is covered by [2-9],      1[0-9],      2[0-3]

So when we are looking for the digit patterns that cover a range (e.g. 13-247), we split off a range before the first number ending in 0 and a range after the last number ending in 9 (note that these split off ranges can be empty), e.g.

13-247 = 13-19, 20-239, 240-247
20-247 =        20-239, 240-247
13-239 = 13-19, 20-239
20-239 =        20-239

The remaining range is handled recursively by taking off the last digits and appending [0-9] to all digit patterns of the reduced range.

When generating pairs start,end for the subranges that can be covered by one digit pattern (as done by bezmax and the OP), the subranges of the reduced range have to be "blown up" correspondingly.

The special cases when there is no number ending in 0 in the range or when there is no number ending in 9 in the range can only happen if start and end only differ at the last digit; in this case the whole range can be covered by one digit pattern.

So here is an alternative implementation of getRegexPairs based on this recursion principle:

private static List<Integer> getRegexPairs(int start, int end)
{
  List<Integer> pairs = new ArrayList<>();   
  if (start > end) return pairs; // empty range
  int firstEndingWith0 = 10*((start+9)/10); // first number ending with 0
  if (firstEndingWith0 > end) // not in range?
  {
    // start and end differ only at last digit
    pairs.add(start);
    pairs.add(end);
    return pairs;
  }

  if (start < firstEndingWith0) // start is not ending in 0
  {
    pairs.add(start);
    pairs.add(firstEndingWith0-1);
  }

  int lastEndingWith9 = 10*(end/10)-1; // last number in range ending with 9
  // all regex for the range [firstEndingWith0,lastEndingWith9] end with [0-9]
  List<Integer> pairsMiddle = getRegexPairs(firstEndingWith0/10, lastEndingWith9/10);
  for (int i=0; i<pairsMiddle.size(); i+=2)
  {
    // blow up each pair by adding all possibilities for appended digit
    pairs.add(pairsMiddle.get(i)  *10+0);
    pairs.add(pairsMiddle.get(i+1)*10+9);
  }

  if (lastEndingWith9 < end) // end is not ending in 9
  {
    pairs.add(lastEndingWith9+1);
    pairs.add(end);
  }

  return pairs;
}
Clarendon answered 8/11, 2015 at 18:8 Comment(0)
E
0

One option would be to (for a range [n, m]) generate the regexp n|n+1|...|m-1|m. However, I think you're after getting something more optimised. You can still do essentially the same, generate a FSM that matches each number using a distinct path through a state machine, then use any of the well-known FSM minimisation algorithms to generate a smaller machine, then turn that into a more condensed regular expression (since "regular expressions" without the Perl extensions are isomorphic to finite state machines).

Let's say we are looking at the range [107, 112]:

state1:
  1 -> state2
  * -> NotOK
state2:
  0 -> state2.0
  1 -> state2.1
  * -> NotOK
state2.0:
  7 -> OK
  8 -> OK
  9 -> OK
  * -> NotOK
state2.1:
  0 -> OK
  1 -> OK
  2 -> OK
  * -> NotOK

We can't really reduce this machine any further. We can see that state2.0 correspond to the RE [789] and 2.1 corresponds to [012]. We can then see that state2.0 is (0[789])|(1[012]) and the whole is 1(0[789])|(1[012]).

Further reading on DFA minimization can be found on Wikipedia (and pages linked from there).

Epistyle answered 4/11, 2015 at 13:35 Comment(0)
B
0

If you find regex pattern range between 5 and 300 which also support float; there is the best answer created by me ...

^0*(([5-9]([.][0-9]{1,2})?)|[1-9][0-9]{1}?([.][0-9]{1,2})?|[12][0-9][0-9]([.][0-9]{1,2})?|300([.]0{1,2})?)$

for 1 to 300 range

^0*([1-9][0-9]?([.][0-9]{1,2})?|[12][0-9][0-9]([.][0-9]{1,2})?|300([.]0{1,2})?)$
Bartle answered 10/1, 2018 at 9:3 Comment(0)
T
0

Bezmax's answer is close but doesn't quite solve the problem correctly. It has a few details incorrect I believe. I have fixed the issues and written the algorithm in c++. The main problem in Bezmax's algorithm is as follows:

The prev function should produce the following: 387 -> 380,379 -> 300,299 -> 100, 99->10, 9->0 Whereas Bezmax had: 387 -> 380,379 -> 300,299 -> 0

Bezmax had 299 "weakening" to 0 this could leave part of the range out in certain circumstances. Basically you want to weaken to the lowest number you can but never change the number of digits. The full solution is too much code to post here but here is the important parts. Hope this helps someone.

    // Find the next number that is advantageous for regular expressions.
    //
    // Starting at the right most decimal digit convert all zeros to nines. Upon
    // encountering the first non-zero convert it to a nine and stop. The output
    // always has the number of digits as the input.
    // examples: 100->999, 0->9, 5->9, 9->9, 14->19, 120->199, 10010->10099
    static int Next(int val)
    {
       assert(val >= 0);

       // keep track of how many nines to add to val.
       int addNines = 0;

       do {
          auto res = std::div(val, 10);
          val = res.quot;
          ++addNines;
          if (res.rem != 0) {
             break;
          }
       } while (val != 0);

       // add the nines
       for (int i = 0; i < addNines; ++i) {
          val = val * 10 + 9;
       }

       return val;
    }

    // Find the previous number that is advantageous for regular expressions.
    //
    // If the number is a single digit number convert it to zero and stop. Else...
    // Starting at the right most decimal digit convert all trailing 9's to 0's
    // unless the digit is the most significant digit - change that 9 to a 1. Upon
    // encounter with first non-nine digit convert it to a zero (or 1 if most
    // significant digit) and stop. The output always has the same number of digits
    // as the input.
    // examples: 0->0, 1->0, 29->10, 999->100, 10199->10000, 10->10, 399->100
    static int Prev(int val)
    {
       assert(val >= 0);

       // special case all single digit numbers reduce to 0
       if (val < 10) {
          return 0;
       }

       // keep track of how many zeros to add to val.
       int addZeros = 0;

       for (;;) {
          auto res = std::div(val, 10);
          val = res.quot;
          ++addZeros;
          if (res.rem != 9) {
             break;
          }

          if (val < 10) {
             val = 1;
             break;
          }
       }

       // add the zeros
       for (int i = 0; i < addZeros; ++i) {
          val *= 10;
       }

       return val;
    }

    // Create a vector of ranges that covers [start, end] that is advantageous for
    // regular expression creation. Must satisfy end>=start>=0.
    static std::vector<std::pair<int, int>> MakeRegexRangeVector(const int start,
                                                                 const int end)
    {
       assert(start <= end);
       assert(start >= 0);

       // keep track of the remaining portion of the range not yet placed into
       // the forward and reverse vectors.
       int remainingStart = start;
       int remainingEnd = end;

       std::vector<std::pair<int, int>> forward;
       while (remainingStart <= remainingEnd) {
          auto nextNum = Next(remainingStart);
          // is the next number within the range still needed.
          if (nextNum <= remainingEnd) {
             forward.emplace_back(remainingStart, nextNum);
             // increase remainingStart as portions of the numeric range are
             // transfered to the forward vector.
             remainingStart = nextNum + 1;
          } else {
             break;
          }
       }
       std::vector<std::pair<int, int>> reverse;
       while (remainingEnd >= remainingStart) {
          auto prevNum = Prev(remainingEnd);
          // is the previous number within the range still needed.
          if (prevNum >= remainingStart) {
             reverse.emplace_back(prevNum, remainingEnd);
             // reduce remainingEnd as portions of the numeric range are transfered
             // to the reverse vector.
             remainingEnd = prevNum - 1;
          } else {
             break;
          }
       }

       // is there any part of the range not accounted for in the forward and
       // reverse vectors?
       if (remainingStart <= remainingEnd) {
          // add the unaccounted for part - this is guaranteed to be expressable
          // as a single regex substring.
          forward.emplace_back(remainingStart, remainingEnd);
       }

       // Concatenate, in reverse order, the reverse vector to forward.
       forward.insert(forward.end(), reverse.rbegin(), reverse.rend());

       // Some sanity checks.
       // size must be non zero.
       assert(forward.size() > 0);

       // verify starting and ending points of the range
       assert(forward.front().first == start);
       assert(forward.back().second == end);

       return forward;
    }
Toilsome answered 21/6, 2018 at 14:13 Comment(0)
O
0

I recently had a requirement where I needed to develop a "Regular Expression Generator For Number Ranges" using Java and I have developed a full solution which is posted on IdeOne and class is called ideone.java -- Source Code on ideone.com. Code on ideone is heavily commented and it uses the same algorithm as posted by other users, so I will only highlight the changes and features added or issues fixed. I used part of solutions provided in answers by Bezmax (concept), arcy (overall code and idea of generating RegEx range as pairs) and coproc (using recursion for generating RegEx pairs, instead of method used by arcy). Thanks to all the three folks.

There are two public methods provided by Ideone.java, which implement the RegExGenerator logic - one accepts string numeric ranges and the other accepts integer number ranges.

generateRegEx(String begStr, String endStr)
generateRegEx(int beg, int end)

Both of the aforementioned public methods call generateRegExCommon method, which in turn calls the getRegExPairsRecursion method (same implementation as provided by coproc's answer), to generate a list containing numbers in pairs which represent lower and upper end of valid RegEx ranges. It then calls formatPairsToRegEx method to convert the RegEx pairs to actual RegEx ranges which may contain prefix of zeros to match the input length, if there were zeros in the input. If the input did not contain leading zeros or if integer input was used, no leading zeros would be added to the output ranges. The output is available as a list of Array of strings, where each entry/element is a valid Regular expression numeric range:

regexArray - String Array where each element is a valid regular expression range.
regexList  - List of String elements where each element is a valid regular expression range.

Figure below shows sequence diagram of a sample execution of Java code (ideone.java) with input and output at each stage. The example used has Input Range of numeric strings with leading zeros where Lower range value is "0006" and upper range is "0977". Output as shown in the figure below is:

000[6-9]
00[1-9][0-9]
0[1-8][0-9][0-9]
09[0-6][0-9]
097[0-7]

RegEx Number Range Generator

Code provides the following benefits:

  • Single comprehensive solution combining the best of all answers and algorithms.
  • Print statements which help in debugging the code and can be easily converted to any logging framework trace/debug statements, instead of System.out.
  • Fixed the problem where low range of 0 was not working with other answers.
Operculum answered 13/8, 2019 at 9:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.