How can I use lookbehind in a C# Regex in order to skip matches of repeated prefix patterns?
Asked Answered
V

3

9

How can I use lookbehind in a C# Regex in order to skip matches of repeated prefix patterns?

Example - I'm trying to have the expression match all the b characters following any number of a characters:

Regex expression = new Regex("(?<=a).*");

foreach (Match result in expression.Matches("aaabbbb"))
  MessageBox.Show(result.Value);

returns aabbbb, the lookbehind matching only an a. How can I make it so that it would match all the as in the beginning?

I've tried

Regex expression = new Regex("(?<=a+).*");

and

Regex expression = new Regex("(?<=a)+.*");

with no results...

What I'm expecting is bbbb.

Vanmeter answered 1/10, 2010 at 13:39 Comment(1)
What's your exptected result?Kaslik
W
8

Are you looking for a repeated capturing group?

(.)\1*

This will return two matches.

Given:

aaabbbb

This will result in:

aaa
bbbb

This:

(?<=(.))(?!\1).*

Uses the above principal, first checking that the finding the previous character, capturing it into a back reference, and then asserting that that character is not the next character.

That matches:

bbbb
Whop answered 1/10, 2010 at 13:43 Comment(2)
I need the lookbehind group to match all the a chars. That is, the actual match is bbbb, as the group of repeated a should be ignored.Vanmeter
@luvieere: I have made that change.Whop
V
5

I figured it out eventually:

Regex expression = new Regex("(?<=a+)[^a]+");

foreach (Match result in expression.Matches(@"aaabbbb"))
   MessageBox.Show(result.Value);

I must not allow the as to me matched by the non-lookbehind group. This way, the expression will only match those b repetitions that follow a repetitions.

Matching aaabbbb yields bbbb and matching aaabbbbcccbbbbaaaaaabbzzabbb results in bbbbcccbbbb, bbzz and bbb.

Vanmeter answered 1/10, 2010 at 14:50 Comment(0)
G
1

The reason the look-behind is skipping the "a" is because it is consuming the first "a" (but no capturing it), then it captures the rest.

Would this pattern work for you instead? New pattern: \ba+(.+)\b It uses a word boundary \b to anchor either ends of the word. It matches at least one "a" followed by the rest of the characters till the word boundary ends. The remaining characters are captured in a group so you can reference them easily.

string pattern = @"\ba+(.+)\b";

foreach (Match m in Regex.Matches("aaabbbb", pattern))
{
    Console.WriteLine("Match: " + m.Value);
    Console.WriteLine("Group capture: " + m.Groups[1].Value);
}

UPDATE: If you want to skip the first occurrence of any duplicated letters, then match the rest of the string, you could do this:

string pattern = @"\b(.)(\1)*(?<Content>.+)\b";

foreach (Match m in Regex.Matches("aaabbbb", pattern))
{
    Console.WriteLine("Match: " + m.Value);
    Console.WriteLine("Group capture: " + m.Groups["Content"].Value);
}
Gowon answered 1/10, 2010 at 13:43 Comment(3)
Do it without having 'b' or 'a' in your regex.Whop
@John thanks I was fixated on the letter "a" specifically. My 2nd sample works with any duplicated character and without hardcoding it.Gowon
Alright, +1, I would argue that mine is a little more concise, but it looks like this is easier to read.Whop

© 2022 - 2024 — McMap. All rights reserved.