Which would be better non-greedy regex or negated character class?
Asked Answered
V

2

9

I need to match @anything_here@ from a string @anything_here@dhhhd@shdjhjs@. So I'd used following regex.

^@.*?@

or

^@[^@]*@

Both way it's work but I would like to know which one would be a better solution. Regex with non-greedy repetition or regex with negated character class?

Vudimir answered 21/12, 2016 at 18:14 Comment(1)
It is clear the ^@[^@]*@ option is much better.Limburg
T
8

Negated character classes should usually be prefered over lazy matching, if possible.

If the regex is successful, ^@[^@]*@ can match the content between @s in a single step, while ^@.*?@ needs to expand for each character between @s.

When failing (for the case of no ending @) most regex engines will apply a little magic and internally treat [^@]* as [^@]*+, as there is a clear cut border between @ and non-@, thus it will match to the end of the string, recognize the missing @ and not backtrack, but instantly fail. .*? will expand character for character as usual.

When used in larger contexts, [^@]* will also never expand over the borders of the ending @ while this is very well possible for the lazy matching. E.g. ^@[^@]*a[^@]*@ won't match @bbbb@a@ while ^@.*?a.*?@ will.

Note that [^@] will also match newlines, while . doesn't (in most regex engines and unless used in singleline mode). You can avoid this by adding the newline character to the negation - if it is not wanted.

Tormentor answered 21/12, 2016 at 18:23 Comment(1)
Note that [^@] will also match newlines, while . doesn't is not true without specifying the regex flavor. In POSIX, TRE and Tcl (Henry Spencer's regex library) regex flavors a dot matches line break symbols by default.Limburg
C
7

It is clear the ^@[^@]*@ option is much better.

The negated character class is quantified greedily which means the regex engine grabs 0 or more chars other than @ right away, as many as possible. See this regex demo and matching:

enter image description here

When you use a lazy dot matching pattern, the engine matches @, then tries to match the trailing @ (skipping the .*?). It does not find the @ at Index 1, so the .*? matches the a char. This .*? pattern expands as many times as there are chars other than @ up to the first @.

See the lazy dot matching based pattern demo here and here is the matching steps:

enter image description here

Ceylon answered 21/12, 2016 at 18:17 Comment(2)
Tip: The process diagrams can be found in TOOLSregex debugger.Ammonate
Exactly what I was going to askMor

© 2022 - 2024 — McMap. All rights reserved.