How does this regex find triangular numbers?
Asked Answered
H

1

44

Part of a series of educational regex articles, this is a gentle introduction to the concept of nested references.

The first few triangular numbers are:

 1 = 1
 3 = 1 + 2
 6 = 1 + 2 + 3
10 = 1 + 2 + 3 + 4
15 = 1 + 2 + 3 + 4 + 5

There are many ways to check if a number is triangular. There's this interesting technique that uses regular expressions as follows:

  • Given n, we first create a string of length n filled with the same character
  • We then match this string against the pattern ^(\1.|^.)+$
    • n is triangular if and only if this pattern matches the string

Here are some snippets to show that this works in several languages:

PHP (on ideone.com)

$r = '/^(\1.|^.)+$/';

foreach (range(0,50) as $n) {
  if (preg_match($r, str_repeat('o', $n))) {
     print("$n ");
  }
}

Java (on ideone.com)

for (int n = 0; n <= 50; n++) {
    String s = new String(new char[n]);
    if (s.matches("(\\1.|^.)+")) {
        System.out.print(n + " ");
    }
}

C# (on ideone.com)

Regex r = new Regex(@"^(\1.|^.)+$");

for (int n = 0; n <= 50; n++) {
    if (r.IsMatch("".PadLeft(n))) {
       Console.Write("{0} ", n);
    }
}

So this regex seems to work, but can someone explain how?

Similar questions

Hesson answered 2/9, 2010 at 13:43 Comment(9)
This series was started by permission of some in the community (meta.stackexchange.com/questions/62695/…). If the reception is good, I plan to continue to cover other more advanced as well as more basic features of regex.Hesson
If this is supposed to be educational and for the community, why isn't it community wiki?Athwart
I think the effort is worth some reputation. Please don't force them into community wiki. Who cares if poly goes from 44k to 50k, what's the difference?Gilbertegilbertian
I look forward to the series. Note that if you are interested in a brief exploration of the origins of regular expressions, I started writing a blog series on that. I never completed it unfortunately. blogs.msdn.com/b/ericlippert/archive/tags/regular+expressionsBronny
If this is a series - how about inventing a tag, that would simplify finding all 'articles' of this series !?Cassirer
@Andreas: you can discuss the logistics of this series on meta. I wouldn't mind creating a new tag, but I'd be careful about meta-tags (e.g. [beginner] was banished -- blog.stackoverflow.com/2010/08/the-death-of-meta-tags). As of now, I plan to update the discussion on meta with links to each articles in the series.Hesson
The second part of the series is now up: #3644766Hesson
Part 3: #3665381Hesson
This question has been added to the Stack Overflow Regular Expression FAQ, under "Advanced Regex-Fu".Infarct
H
37

Explanation

Here's a schematic breakdown of the pattern:

from beginning…
|         …to end
|         |
^(\1.|^.)+$
 \______/|___match
  group 1    one-or-more times

The (…) brackets define capturing group 1, and this group is matched repeatedly with +. This subpattern is anchored with ^ and $ to see if it can match the entire string.

Group 1 tries to match this|that alternates:

  • \1., that is, what group 1 matched (self reference!), plus one of "any" character,
  • or ^., that is, just "any" one character at the beginning

Note that in group 1, we have a reference to what group 1 matched! This is a nested/self reference, and is the main idea introduced in this example. Keep in mind that when a capturing group is repeated, generally it only keeps the last capture, so the self reference in this case essentially says:

"Try to match what I matched last time, plus one more. That's what I'll match this time."

Similar to a recursion, there has to be a "base case" with self references. At the first iteration of the +, group 1 had not captured anything yet (which is NOT the same as saying that it starts off with an empty string). Hence the second alternation is introduced, as a way to "initialize" group 1, which is that it's allowed to capture one character when it's at the beginning of the string.

So as it is repeated with +, group 1 first tries to match 1 character, then 2, then 3, then 4, etc. The sum of these numbers is a triangular number.


Further explorations

Note that for simplification, we used strings that consists of the same repeating character as our input. Now that we know how this pattern works, we can see that this pattern can also match strings like "1121231234", "aababc", etc.

Note also that if we find that n is a triangular number, i.e. n = 1 + 2 + … + k, the length of the string captured by group 1 at the end will be k.

Both of these points are shown in the following C# snippet (also seen on ideone.com):

Regex r = new Regex(@"^(\1.|^.)+$");

Console.WriteLine(r.IsMatch("aababc"));     // True
Console.WriteLine(r.IsMatch("1121231234")); // True
Console.WriteLine(r.IsMatch("iLoveRegEx")); // False

for (int n = 0; n <= 50; n++) {
    Match m = r.Match("".PadLeft(n));
    if (m.Success) {
       Console.WriteLine("{0} = sum(1..{1})", n, m.Groups[1].Length);
    }
}
// 1 = sum(1..1)
// 3 = sum(1..2)
// 6 = sum(1..3)
// 10 = sum(1..4)
// 15 = sum(1..5)
// 21 = sum(1..6)
// 28 = sum(1..7)
// 36 = sum(1..8)
// 45 = sum(1..9)

Flavor notes

Not all flavors support nested references. Always familiarize yourself with the quirks of the flavor that you're working with (and consequently, it almost always helps to provide this information whenever you're asking regex-related questions).

In most flavors, the standard regex matching mechanism tries to see if a pattern can match any part of the input string (possibly, but not necessarily, the entire input). This means that you should remember to always anchor your pattern with ^ and $ whenever necessary.

Java is slightly different in that String.matches, Pattern.matches and Matcher.matches attempt to match a pattern against the entire input string. This is why the anchors can be omitted in the above snippet.

Note that in other contexts, you may need to use \A and \Z anchors instead. For example, in multiline mode, ^ and $ match the beginning and end of each line in the input.

One last thing is that in .NET regex, you CAN actually get all the intermediate captures made by a repeated capturing group. In most flavors, you can't: all intermediate captures are lost and you only get to keep the last.

Related questions


Bonus material: Using regex to find power of twos!!!

With very slight modification, you can use the same techniques presented here to find power of twos.

Here's the basic mathematical property that you want to take advantage of:

  • 1 = 1
  • 2 = (1) + 1
  • 4 = (1+2) + 1
  • 8 = (1+2+4) + 1
  • 16 = (1+2+4+8) + 1
  • 32 = (1+2+4+8+16) + 1

The solution is given below (but do try to solve it yourself first!!!!)

(see on ideone.com in PHP, Java, and C#): ^(\1\1|^.)*.$

Hesson answered 2/9, 2010 at 13:43 Comment(4)
Note also that the fact that there are regular expression libraries that can match this pattern is, ironically, evidence that so-called "regular" expressions are not regular! The formal definition of a "regular language" is, roughly speaking, "a set of strings that are exactly those matched by some matching machine that has a bounded number of internal states", but knowing "what I matched before" requires in theory a potentially unbounded number of states.Bronny
@Eric - So I guess you would call them irregular expressions?Rationale
@Eric, @Chaos I asked a question about this incongruity on the theoretical computer-science StackExchange beta.Jawbone
Formal regular expressions have onlyDruid

© 2022 - 2024 — McMap. All rights reserved.