Regex Non-Greedy (Lazy)
Asked Answered
P

3

40

I'm attempting to non-greedily parse out TD tags. I'm starting with something like this:

<TD>stuff<TD align="right">More stuff<TD align="right>Other stuff<TD>things<TD>more things

I'm using the below as my regex:

Regex.Split(tempS, @"\<TD[.\s]*?\>");

The records return as below:

""
"stuff<TD align="right">More stuff<TD align="right>Other stuff"
"things"
"more things"

Why is it not splitting that first full result (the one starting with "stuff")? How can I adjust the regex to split on all instances of the TD tag with or without parameters?

Package answered 12/12, 2012 at 16:28 Comment(4)
Please see #1732848Synonym
. just means a literal dot in character class [.], not 'any character. You may have more success with [^>]*, but it would break on a > in an attribute (which is one of the reasons why we often look at parsers rather the regexes to manipulate html & xml).Gerhardine
@Gerhardine The HTML here is pretty static. There isn't much variation and I know the regex that would work for it. I didn't go the route of parsers because of that. Is there a way to make the . character mean 'any character' including whitespace?Package
I don't know the c# modifiers (in pcre it would be /s) to make the dot match all. However [^>]*> is functionally equivalent to (.|\s)*?>, and probably easier on the regex.Gerhardine
I
17

The regex you want is <TD[^>]*>:

<     # Match opening tag
TD    # Followed by TD
[^>]* # Followed by anything not a > (zero or more)
>     # Closing tag

Note: . matches anything (including whitespace) so [.\s]*? is redundant and wrong as [.] matches a literal . so use .*?.

Indignation answered 12/12, 2012 at 16:36 Comment(1)
By default, . does not match new line but \s does.Blase
G
61

For non greedy match, try this <TD.*?>

Gigantic answered 12/12, 2012 at 16:47 Comment(1)
@Hambone Because ? after the quantifier * tells Regex engine to stop eating symbols when it finds the first match of the expression which follows ?, that is - >. The difference is because of greedy vs non-greedy *.Electricity
T
19

From https://regex101.com/

  • * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
  • *? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Tera answered 11/6, 2018 at 6:12 Comment(0)
I
17

The regex you want is <TD[^>]*>:

<     # Match opening tag
TD    # Followed by TD
[^>]* # Followed by anything not a > (zero or more)
>     # Closing tag

Note: . matches anything (including whitespace) so [.\s]*? is redundant and wrong as [.] matches a literal . so use .*?.

Indignation answered 12/12, 2012 at 16:36 Comment(1)
By default, . does not match new line but \s does.Blase

© 2022 - 2024 — McMap. All rights reserved.