Regex optional capturing group?
Asked Answered
S

4

51

After hours of searching I decided to ask this question. Why doesn't this regular expression ^(dog).+?(cat)? work as I think it should work (i.e. capture the first dog and cat if there is any)? What am I missing here?

dog, cat
dog, dog, cat
dog, dog, dog
Suspiration answered 28/2, 2015 at 14:4 Comment(0)
R
41

The reason that you do not get an optional cat after a reluctantly-qualified .+? is that it is both optional and non-anchored: the engine is not forced to make that match, because it can legally treat the cat as the "tail" of the .+? sequence.

If you anchor the cat at the end of the string, i.e. use ^(dog).+?(cat)?$, you would get a match, though:

Pattern p = Pattern.compile("^(dog).+?(cat)?$");
for (String s : new String[] {"dog, cat", "dog, dog, cat", "dog, dog, dog"}) {
    Matcher m = p.matcher(s);
    if (m.find()) {
        System.out.println(m.group(1)+" "+m.group(2));
    }
}

This prints (demo 1)

dog cat
dog cat
dog null

Do you happen to know how to deal with it in case there's something after cat?

You can deal with it by constructing a trickier expression that matches anything except cat, like this:

^(dog)(?:[^c]|c[^a]|ca[^t])+(cat)?

Now the cat could happen anywhere in the string without an anchor (demo 2).

Rondarondeau answered 28/2, 2015 at 14:14 Comment(2)
Thanks. Do you happen to know how to deal with it in case there's something after cat? For example: dog, dog, cat, blah. I want to capture only first dog and optional cat (there can be at most one cat).Suspiration
would love to see this answerSair
C
18

Without any particular order, other options to match such patterns are:

Method 1

With non-capturing groups:

^(?:dog(?:, |$))+(?:cat)?$

RegEx Demo 1

Or with capturing groups:

^(dog(?:, |$))+(cat)?$

RegEx Demo 2


Method 2

With lookarounds,

(?<=^|, )dog|cat(?=$|,)

RegEx Demo 3

With word boundaries,

(?<=^|, )\b(?:dog|cat)\b(?=$|,)

RegEx Demo 4


Method 3

If we would have had only one cat and no dog in the string, then

^(?:dog(?:, |$))*(?:cat)?$

would have been an option too.

RegEx Demo 5

Test

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class RegularExpression{

    public static void main(String[] args){

        final String regex = "^(?:dog(?:, |$))*(?:cat)?$";
        final String string = "cat\n"
             + "dog, cat\n"
             + "dog, dog, cat\n"
             + "dog, dog, dog\n"
             + "dog, dog, dog, cat\n"
             + "dog, dog, dog, dog, cat\n"
             + "dog, dog, dog, dog, dog\n"
             + "dog, dog, dog, dog, dog, cat\n"
             + "dog, dog, dog, dog, dog, dog, dog, cat\n"
             + "dog, dog, dog, dog, dog, dog, dog, dog, dog\n";

        final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
        final Matcher matcher = pattern.matcher(string);

        while (matcher.find()) {
            System.out.println("Full match: " + matcher.group(0));
            for (int i = 1; i <= matcher.groupCount(); i++) {
                System.out.println("Group " + i + ": " + matcher.group(i));
            }
        }

    }
}

Output

Full match: cat
Full match: dog, cat
Full match: dog, dog, cat
Full match: dog, dog, dog
Full match: dog, dog, dog, cat
Full match: dog, dog, dog, dog, cat
Full match: dog, dog, dog, dog, dog
Full match: dog, dog, dog, dog, dog, cat
Full match: dog, dog, dog, dog, dog, dog, dog, cat
Full match: dog, dog, dog, dog, dog, dog, dog, dog, dog

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Cristencristi answered 11/11, 2019 at 1:6 Comment(0)
B
12

@dasblinkenlight's answer is great, but here's a regexp that improves the 2nd part of it, when he/she's asked

Do you happen to know how to deal with it in case there's something after cat?

The regexp ^(dog)(.+(cat))? would require you to capture group no. 3 instead of 2 to get the optional cat, but works just as well without the char-by-char trickery.

And here's the demo (which, again, is forked from @dasblinkenlight's demo which allowed me to tinker and find this solution, thanks again!)

Bessiebessy answered 7/11, 2016 at 17:46 Comment(1)
Also works with a non-capturing group like ^(dog)(?:.+(cat))? so you don't have an extra capturing group in thereRichert
F
3

@figha's extension can be extended slightly further still, to not make the unnecessary second capture.

Use ?: to make a bracketed part of a regex non-capturable. So the regex becomes: ^(dog)(?:.+(cat))?

Again, here's the extended demo and the regex test.

Fitter answered 9/12, 2016 at 10:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.