Non-greedy Regular Expression in Java
Asked Answered
A

2

18

I have next code:

public static void createTokens(){
    String test = "test is a word word word word big small";
    Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+?\\s*)").matcher(test);
    while (mtch.find()){
        for (int i = 1; i <= mtch.groupCount(); i++){
            System.out.println(mtch.group(i));
        }
    }
}

And have next output:

word
w

But in my opinion it must be:

word
word

Somebody please explain me why so?

Allusive answered 19/1, 2012 at 18:16 Comment(0)
T
21

Because your patterns are non-greedy, so they matched as little text as possible while still consisting of a match.

Remove the ? in the second group, and you'll get
word
word word big small

Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+\\s*)").matcher(test);
Triggerhappy answered 19/1, 2012 at 18:22 Comment(2)
And now the second group is capturing too much instead of too little. Non-greediness is not the problem, and greediness is not the solution.Greengrocery
You're correct, but IMHO, the non-greedyness of the second capturing group explains why it captures simply "w". The first capturing group has to capture "word" because of the "word" literal following it. I don't know exactly what he's looking for and he edited the question after i submitted my answer, so i can't supply a correct regexp.Triggerhappy
P
4

By using \\s* it will match any number of spaces including 0 spaces. w matches (\\s*.+?\\s*). To make sure it matches a word separated by spaces try (\\s+.+?\\s+)

Pudding answered 19/1, 2012 at 18:23 Comment(2)
Trouble is, the regex is already consuming the space characters before and after the word, so now you're trying to consume them twice.Greengrocery
All you would need to do is remove the space from the regex like ...\\s+)word(\\s+...Obe

© 2022 - 2024 — McMap. All rights reserved.