Make one or zero regex operator greedy
Asked Answered
N

2

6

I have two sentences as input. Let's say for example:

<span>I love my red car.</span>
<span>I love my car.</span>

Now I want to match every textpart inside the span-tags AND if available the color.

If I use the following regex:

/<span>(.*?)(?P<color>red)(.*?)<\/span>/ms

Only the line with the color is matched. So I thought let's use ?-operator (for one or zero).

/<span>(.*?)(?P<color>red)?(.*?)<\/span>/ms

Now both lines/sentences will be matched. Sadly the color isn't matched anymore.

The question is why? By using ".*?" before the color part, I thought I had made the regex non-greedy, so that the color part would match, if it's existent. But as told, it doesn't...

Newmown answered 18/9, 2013 at 7:16 Comment(5)
Regex + markup go together like petrol and mules: though both useful, they don't work well together. Use DOMDocumentFley
@EliasVanOotegem here DOMDocument is not the point since matter is about parsing I love my red car string, which is just plain text.Heidt
@AlmaDoMundo "I want to match every textpart inside the span-tags" => Who's to say that the snippet provided isn't part of a bigger string of markup, containting div tags?Fley
@EliasVanOotegem I think it's irrelevant, since the question is the same regardless of whether this is in HTML or not, as long as it's "something between two somethings".Katabatic
What does this have to do with the 'one or zero' regex operator?Dupuis
P
5

The first (.*?) will match between > and I and since it's lazy, it'll test the next part of the regex immediately: (?P<color>red)? but there's no red at that point, so the 0 option of ? 'activates' and the regex continues to the next part, which is (.*?). It'll again match the part between > and I and since it's lazy, it'll check the next part of the regex: <\/span> (I'm taking it as a whole).

So the second (.*?) will match all the way there.

Indeed, your results[1] will be null, as will be results[color] (I don't remember if you have to quote color or not) and results[3] will contain I love my red car..

Hmm, one workaround is to use OR like NickC mentioned in his answer. Another you might use is by using a negative lookahead to check for each character:

<span>((?:(?!\bred\b).)*(?<colour>\bred\b)?.*)<\/span>

regex101 demo

As a side note, I would advise using the word boundaries so that you don't match things like reduce or jarred.

Parricide answered 18/9, 2013 at 7:35 Comment(2)
Thank you for the explanation on why it doesn't work! I do like my solution as it doesn't require double entry of the possible values :)Katabatic
@NickC I was about to post something a bit like yours and then you posted your answer before I could; I just didn't want to have the same regex ^^;Parricide
K
2

This should work:

/<span>(.*?(?P<color>red).*?|.*?)<\/span>/ms

Your original expression was pretty good. I modified it slightly to make a new outer group match the whole sentence. I used that new outer group to create an "or" condition to match "anything", in case the color is not present.

Abbreviated output:

Array
    [0] => Array
            [0] => <span>I love my red car.</span>
            [1] => <span>I love my car.</span>

    [1] => Array
            [0] => I love my red car.
            [1] => I love my car.

    [color] => Array
            [0] => red
            [1] => 
Katabatic answered 18/9, 2013 at 7:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.