Using regex to match string between two strings while excluding strings
Asked Answered
R

5

5

Following on from a previous question in which I asked:

How can I use a regular expression to match text that is between two strings, where those two strings are themselves enclosed two other strings, with any amount of text between the inner and outer enclosing strings?

I got this answer:

/outer-start.*?inner-start(.*?)inner-end.*?outer-end/

I would now like to know how to exclude certain strings from the text between the outer enclosing strings and the inner enclosing strings.

For example, if I have this text:

outer-start some text inner-start text-that-i-want inner-end some more text outer-end

I would like 'some text' and 'some more text' not to contain the word 'unwanted'.

In other words, this is OK:

outer-start some wanted text inner-start text-that-i-want inner-end some more wanted text outer-end

But this is not OK:

outer-start some unwanted text inner-start text-that-i-want inner-end some more unwanted text outer-end

Or to explain further, the expression between outer and inner delimiters in the previous answer above should exclude the word 'unwanted'.

Is this easy to match using regexes?

Revolver answered 2/1, 2010 at 22:53 Comment(1)
What exactly are you trying to do?Woolley
C
6

Replace the first and last (but not the middle) .*? with (?:(?!unwanted).)*?. (Where (?:...) is a non-capturing group, and (?!...) is a negative lookahead.)

However, this quickly ends up with corner cases and caveats in any real (instead of example) use, and if you would ask about what you're really doing (with real examples, even if they're simplified, instead of made up examples), you'll likely get better answers.

Clausen answered 2/1, 2010 at 23:3 Comment(1)
That's a better solution than mine.Lampert
A
1

You can replace .*? with

 ([^u]|u[^n]|un[^w]|unw[^a]|unwa[^n]|unwan[^t]|unwant[^e]|unwante[^d])*?

This is a solution in "pure" regex; the language you are using might allow you to use some more elegant construct.

Alamein answered 2/1, 2010 at 23:2 Comment(0)
L
1

You can't easily do that with plain regexes, but some systems such as Perl have extensions that make it easier. One way is to use a negative look-ahead assertion:

/outer-start(?:u(?!nwanted)|[^u])*?inner-start(.*?)inner-end.*?outer-end/

The key is to split up the "unwanted" into ("u" not followed by "nwanted") or (not "u"). That allows the pattern to advance, but will still find and reject all "unwanted" strings.

People may start hating your code if you do much of this though. ;)

Lampert answered 2/1, 2010 at 23:5 Comment(0)
O
0

Tola, resurrecting this question because it had a fairly simple regex solution that wasn't mentioned. This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

The idea is to build an alternation (a series of |) where the left sides match what we don't want in order to get it out of the way... then the last side of the | matches what we do want, and captures it to Group 1. If Group 1 is set, you retrieve it and you have a match.

So what do we not want?

First, we want to eliminate the whole outer block if there is unwanted between outer-start and inner-start. You can do it with:

outer-start(?:(?!inner-start).)*?unwanted.*?outer-end

This will be to the left of the first |. It matches a whole outer block.

Second, we want to eliminate the whole outer block if there is unwanted between inner-end and outer-end. You can do it with:

outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end

This will be the middle |. It looks a bit complicated because we want to make sure that the "lazy" *? does not jump over the end of a block into a different block.

Third, we match and capture what we want. This is:

inner-start\s*(text-that-i-want)\s*inner-end

So the whole regex, in free-spacing mode, is:

(?xs)
outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
| # OR (also don't want that)
outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
| # OR capture what we want
inner-start\s*(text-that-i-want)\s*inner-end

On this demo, look at the Group 1 captures on the right: It contains what we want, and only for the right block.

In Perl and PCRE (used for instance in PHP), you don't even have to look at Group 1: you can force the regex to skip the two blocks we don't want. The regex becomes:

(?xs)
(?: # non-capture group: the things we don't want
outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
| # OR (also don't want that)
outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
)
(*SKIP)(*F) # we don't want this, so fail and skip
| # OR capture what we want
inner-start\s*\Ktext-that-i-want(?=\s*inner-end)

See demo: it directly matches what you want.

The technique is explained in full detail in the question and article below.

Reference

Orthography answered 25/6, 2014 at 23:30 Comment(0)
D
-1

Try replacing the last .*? with: (?!(.*unwanted text.*))

Did it work?

Decalcify answered 2/1, 2010 at 23:1 Comment(1)
If you're unsure (and even if you think you're sure), you should test your pattern locally (or on a site like codepad.org), which is why regex questions need good examples (both passing and failing).Clausen

© 2022 - 2024 — McMap. All rights reserved.