Regex - Get string between two words that doesn't contain word

G

5

7

I've been looking around and could not make this happen. I am not totally noob.

I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.

Example string:

abcSTARTabcSTARTabcENDabc

The expected result:

STARTabcEND

Not good:

STARTabcSTARTabcEND

I can't use backward search stuff. I am testing my regex here: www.regextester.com

Thanks for any advice.

Gradely answered 7/9, 2011 at 11:33 Comment(4)

What if the text is abcSTARTabcENDabcSTARTabcENDabc? Do you want both matches? – Interposition 8/9, 2011 at 7:5

didn't think about that ... anyway, I can find second match if needed. – Gradely 5/10, 2011 at 11:54

Better to do that in a single regex. I've added an answer. – Interposition 5/10, 2011 at 13:31

You can test your regex at rubular.com – Utrillo 14/12, 2011 at 7:57

B

4

The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.

Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.

Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.

Bal answered 7/9, 2011 at 11:50 Comment(9)

Nice solution (if no lookaheads possible) +1 – Saponify 7/9, 2011 at 12:0

This is what I was looking for, thanks. Indeed ... pedestrian :) but it works. I was hoping that there might be an easier way that I am missing. Sorry for not posting back earlier. – Gradely 5/10, 2011 at 11:45

What is the last part for? Why do you need (S(T(AR?)?)?)? – Jipijapa 31/5, 2017 at 7:54

Okay! I get it... you need ...(S(T(AR?)?)?)?... because otherwise, you have to consume characters after S, ST, STA and STAR... This is freaking genius. – Jipijapa 31/5, 2017 at 8:18

Not sure what you mean by that. A substring of START is allowed before the END delimiter and up through there we have been preventing these substrings from matching. – Bal 31/5, 2017 at 9:25

I don't understand the answer. My question was why did you need to have this part (S(T(AR?)?)?)? but I think the reason is that otherwise you won't match something like STARTSTAREND. The (S(T(AR?)?)?)? let's you cleanly consume any substring of STARthat comes directly before END. – Jipijapa 1/6, 2017 at 15:49

Yes, exactly. Earlier in the match, we allow STAR if it is followed by something which isn't T, but just before the end delimiter we also allow it to be followed by nothing. (Using "consume" in this context is a bit weird, IMHO, though.) – Bal 1/6, 2017 at 16:40

Thanks for prodding me, I think I found a bug, though it's not directly related to this. I'll try to fix it tomorrow. – Bal 1/6, 2017 at 16:41

See also #406730 – Bal 1/6 at 5:11

S

9

Try this

START(?!.*START).*?END

See it here online on Regexr

(?!.*START) is a negative lookahead. It ensures that the word "START" is not following

.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)

Update:

I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version

START(?!.*START).*END

this will match till the last "END".

Saponify answered 7/9, 2011 at 11:39 Comment(4)

+1 for good answer with simple explanations of all the operators – Floorboard 7/9, 2011 at 14:8

This will fail if there is more than one START...END pair in the string. (Or more precisely, it will only find the last START...END pair in the string.) – Interposition 5/10, 2011 at 13:32

To clarify Tim's comment: your regexp will NOT match where you expect it to if there is ANY second occurrence of START, be it before or after END (e.g. abcSTARTabcENDxyzSTART will not match) – Stasiastasis 23/1, 2015 at 20:37

Yeah, it simply asks if there is any occurrence of start in the future and if so, will not match. This is not the wanted (described) behavior. – Jipijapa 1/6, 2017 at 15:31

I

7

START(?:(?!START).)*END

will work with any number of START...END pairs. To demonstrate in Python:

>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']

If you only care for the content between START and END, use this:

(?<=START)(?:(?!START).)*(?=END)

See it here:

>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']

Interposition answered 5/10, 2011 at 13:27 Comment(1)

Yup, This will do it. +1 (Although you may want to mention/use the s dot-matches-all flag.) – Romonaromonda 5/10, 2011 at 15:18