Question marks in regular expressions

G

5

86

I'm reading the regular expressions reference and I'm thinking about ? and ?? characters. Could you explain me with some examples their usefulness? I don't understand them enough.

thank you

Grader answered 7/4, 2011 at 15:26 Comment(2)

What is your target programming language for using regexes? Regexes behave a little differently across languages. – Chronicle 7/4, 2011 at 15:37

I used regex in python, C#, php, perl, visual basic, grep. – Grader 7/4, 2011 at 17:41

C

58

The key difference between ? and ?? concerns their laziness. ?? is lazy, ? is not.

Let's say you want to search for the word "car" in a body of text, but you don't want to be restricted to just the singular "car"; you also want to match against the plural "cars".

Here's an example sentence:

I own three cars.

Now, if I wanted to match the word "car" and I only wanted to get the string "car" in return, I would use the lazy ?? like so:

cars??

This says, "look for the word car or cars; if you find either, return car and nothing more".

Now, if I wanted to match against the same words ("car" or "cars") and I wanted to get the whole match in return, I'd use the non-lazy ? like so:

cars?

This says, "look for the word car or cars, and return either car or cars, whatever you find".

In the world of computer programming, lazy generally means "evaluating only as much as is needed". So the lazy ?? only returns as much as is needed to make a match; since the "s" in "cars" is optional, don't return it. On the flip side, non-lazy (sometimes called greedy) operations evaluate as much as possible, hence the ? returns all of the match, including the optional "s".

Personally, I find myself using ? as a way of making other regular expression operators lazy (like the * and + operators) more often than I use it for simple character optionality, but YMMV.

See it in Code

Here's the above implemented in Clojure as an example:

(re-find #"cars??" "I own three cars.")
;=> "car"

(re-find #"cars?" "I own three cars.")
;=> "cars"

The item re-find is a function that takes its first argument as a regular expression #"cars??" and returns the first match it finds in the second argument "I own three cars."

Chronicle answered 7/4, 2011 at 15:49 Comment(6)

Your cars?? example is correct, but it returns the same results as if you had simply used car. You might need a different example to demonstrate the usefulness of ??. – Ganof 7/4, 2011 at 16:27

@Justin, true, but yours has the same problem. – Seriocomic 7/4, 2011 at 17:16

@Matthew Flaschen - The third input string in my answer produces identical results when you leave out the s??, but the others do not. That's how it differs from leaving the optional element out of the pattern: by making the same pattern work for all three input strings. – Ganof 7/4, 2011 at 19:0

@Chronicle Hi , what if the character that I want to check for zero or one occurrence is ? itself ? – Mejia 21/7, 2018 at 10:3

@VaradBhatnagar You would need to escape the ? character in your regular expression. As an example in Clojure, if you wanted to match the string foo?, you could use (re-find #"foo\?" "foo?") where \? escapes the question mark in the regular expression so that it is treated literally, rather than as a regular expression operator. – Chronicle 13/8, 2018 at 20:46

@Chronicle - I don't think I understand the part where the program knows that s is somehow an optional character in cars??. What if I applied that for another language and searched for either bil or biler (the latter being plural of the first word)? Would the program then look for bil and biler, or would it look for bile (where e is too many letters) and biler, or is s somehow a special character coupled with the first ? (e.i. s? asks if there's an s or no s)? Or maybe it asks whether any of the letters in biler is present, and so b would also return true? – Jonathonjonati 14/12, 2020 at 23:59

G

95

This is an excellent question, and it took me a while to see the point of the lazy ?? quantifier myself.

? - Optional (greedy) quantifier

The usefulness of ? is easy enough to understand. If you wanted to find both http and https, you could use a pattern like this:

https?

This pattern will match both inputs, because it makes the s optional.

?? - Optional (lazy) quantifier

?? is more subtle. It usually does the same thing ? does. It doesn't change the true/false result when you ask: "Does this input satisfy this regex?" Instead, it's relevant to the question: "Which part of this input matches this regex, and which parts belong in which groups?" If an input could satisfy the pattern in more than one way, the engine will decide how to group it based on ? vs. ?? (or * vs. *?, or + vs. +?).

Say you have a set of inputs that you want to validate and parse. Here's an (admittedly silly) example:

Input:       
http123
https456
httpsomething

Expected result:
Pass/Fail  Group 1   Group 2
Pass       http      123
Pass       https     456
Pass       http      something

You try the first thing that comes to mind, which is this:

^(http)([a-z\d]+)$

Pass/Fail  Group 1   Group 2    Grouped correctly?
Pass       http      123        Yes
Pass       http      s456       No
Pass       http      something  Yes

They all pass, but you can't use the second set of results because you only wanted 456 in Group 2.

Fine, let's try again. Let's say Group 2 can be letters or numbers, but not both:

(https?)([a-z]+|\d+)

Pass/Fail  Group 1   Group 2   Grouped correctly?
Pass       http      123       Yes
Pass       https     456       Yes
Pass       https     omething  No

Now the second input is fine, but the third one is grouped wrong because ? is greedy by default (the + is too, but the ? came first). When deciding whether the s is part of https? or [a-z]+|\d+, if the result is a pass either way, the regex engine will always pick the one on the left. So Group 2 loses s because Group 1 sucked it up.

To fix this, you make one tiny change:

(https??)([a-z]+|\d+)$

Pass/Fail  Group 1   Group 2    Grouped correctly?
Pass       http      123        Yes
Pass       https     456        Yes
Pass       http      something  Yes

Essentially, this means: "Match https if you have to, but see if this still passes when Group 1 is just http." The engine realizes that the s could work as part of [a-z]+|\d+, so it prefers to put it into Group 2.

Ganof answered 7/4, 2011 at 15:49 Comment(10)

In all your cases, https??([a-z]+|\d+) and http([a-z]+|\d+) (no s before capture at all) give the same matches and captures. So I don't see how this is a meaningful example. – Seriocomic 7/4, 2011 at 17:12

Your answer is excellent too. Actually I had problem only with ?? :-) and was looking what is different in opposite to ? . – Grader 7/4, 2011 at 17:53

@Matthew http([a-z]+|\d+) won't match https(456). That's the difference. – Grader 7/4, 2011 at 17:58

@xralf, no. They both match with exactly the same match and capture: With ??, Without. – Seriocomic 7/4, 2011 at 18:2

@Matthew Flaschen - They work the same for that input. http([a-z]+|\d+)$ will not match https456. https??([a-z]+|\d+)$ will, and still have the expected results for https456. That's the difference. – Ganof 7/4, 2011 at 19:1

@Grader - Note that I missed the $ before; it's necessary to capture properly from https456. See edit. – Ganof 7/4, 2011 at 19:4

@Justin, yep, the demonstration is valid with the end anchor. – Seriocomic 7/4, 2011 at 20:23

@JustinMorgan - Would htt, ht and h also return true with https?? – Jonathonjonati 15/12, 2020 at 0:4

@Jonathonjonati - No, because only the s is marked optional. The http part is required. – Ganof 6/4, 2021 at 14:40

This answer is exactly like the "The Man Song" for explaining how two question marks work. But this explanation is fantastic, because it needs this explanation to understand it!! youtube.com/watch?v=t7Y0I91rubg – Auditor 9/5 at 16:54

C

58

The key difference between ? and ?? concerns their laziness. ?? is lazy, ? is not.

Let's say you want to search for the word "car" in a body of text, but you don't want to be restricted to just the singular "car"; you also want to match against the plural "cars".

Here's an example sentence:

I own three cars.

Now, if I wanted to match the word "car" and I only wanted to get the string "car" in return, I would use the lazy ?? like so:

cars??

This says, "look for the word car or cars; if you find either, return car and nothing more".

Now, if I wanted to match against the same words ("car" or "cars") and I wanted to get the whole match in return, I'd use the non-lazy ? like so:

cars?

This says, "look for the word car or cars, and return either car or cars, whatever you find".

In the world of computer programming, lazy generally means "evaluating only as much as is needed". So the lazy ?? only returns as much as is needed to make a match; since the "s" in "cars" is optional, don't return it. On the flip side, non-lazy (sometimes called greedy) operations evaluate as much as possible, hence the ? returns all of the match, including the optional "s".

Personally, I find myself using ? as a way of making other regular expression operators lazy (like the * and + operators) more often than I use it for simple character optionality, but YMMV.

See it in Code

Here's the above implemented in Clojure as an example:

(re-find #"cars??" "I own three cars.")
;=> "car"

(re-find #"cars?" "I own three cars.")
;=> "cars"

The item re-find is a function that takes its first argument as a regular expression #"cars??" and returns the first match it finds in the second argument "I own three cars."

Chronicle answered 7/4, 2011 at 15:49 Comment(6)

Your cars?? example is correct, but it returns the same results as if you had simply used car. You might need a different example to demonstrate the usefulness of ??. – Ganof 7/4, 2011 at 16:27

@Justin, true, but yours has the same problem. – Seriocomic 7/4, 2011 at 17:16

@Matthew Flaschen - The third input string in my answer produces identical results when you leave out the s??, but the others do not. That's how it differs from leaving the optional element out of the pattern: by making the same pattern work for all three input strings. – Ganof 7/4, 2011 at 19:0

@Chronicle Hi , what if the character that I want to check for zero or one occurrence is ? itself ? – Mejia 21/7, 2018 at 10:3

@VaradBhatnagar You would need to escape the ? character in your regular expression. As an example in Clojure, if you wanted to match the string foo?, you could use (re-find #"foo\?" "foo?") where \? escapes the question mark in the regular expression so that it is treated literally, rather than as a regular expression operator. – Chronicle 13/8, 2018 at 20:46

@Chronicle - I don't think I understand the part where the program knows that s is somehow an optional character in cars??. What if I applied that for another language and searched for either bil or biler (the latter being plural of the first word)? Would the program then look for bil and biler, or would it look for bile (where e is too many letters) and biler, or is s somehow a special character coupled with the first ? (e.i. s? asks if there's an s or no s)? Or maybe it asks whether any of the letters in biler is present, and so b would also return true? – Jonathonjonati 14/12, 2020 at 23:59

P

36

Some Other Uses of Question marks in regular expressions

Apart from what's explained in other answers, there are still 3 more uses of Question Marks in regular expressions.

Negative Lookahead

Negative lookaheads are used if you want to match something not followed by something else. The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. x(?!x2)

example
- Consider a word There
- Now, by default, the RegEx e will find the third letter e in word There.
```
There
  ^
```
- However if you don't want the e which is immediately followed by r, then you can use RegEx e(?!r). Now the result would be:
```
There
    ^
```
Positive Lookahead

Positive lookahead works just the same. q(?=u) matches a q that is immediately followed by a u, without making the u part of the match. The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.

example
- Consider a word getting
- Now, by default, the RegEx t will find the third letter t in word getting.
```
getting
  ^
```
- However if you want the t which is immediately followed by i, then you can use RegEx t(?=i). Now the result would be:
```
getting
   ^
```
Non-Capturing Groups

Whenever you place a Regular Expression in parenthesis(), they create a numbered capturing group. It stores the part of the string matched by the part of the regular expression inside the parentheses.

If you do not need the group to capture its match, you can optimize this regular expression into
```
(?:Value)
```

See it in Code

? - Optional (greedy) quantifier

?? - Optional (lazy) quantifier

See it in Code

Some Other Uses of Question marks in regular expressions

Recommended topics

Hot tags