Regex - Get string between two words that doesn't contain word
Asked Answered
G

5

7

I've been looking around and could not make this happen. I am not totally noob.

I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.

Example string:

abcSTARTabcSTARTabcENDabc

The expected result:

STARTabcEND

Not good:

STARTabcSTARTabcEND

I can't use backward search stuff. I am testing my regex here: www.regextester.com

Thanks for any advice.

Gradely answered 7/9, 2011 at 11:33 Comment(4)
What if the text is abcSTARTabcENDabcSTARTabcENDabc? Do you want both matches?Interposition
didn't think about that ... anyway, I can find second match if needed.Gradely
Better to do that in a single regex. I've added an answer.Interposition
You can test your regex at rubular.comUtrillo
B
4

The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.

Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.

Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.

Bal answered 7/9, 2011 at 11:50 Comment(9)
Nice solution (if no lookaheads possible) +1Saponify
This is what I was looking for, thanks. Indeed ... pedestrian :) but it works. I was hoping that there might be an easier way that I am missing. Sorry for not posting back earlier.Gradely
What is the last part for? Why do you need (S(T(AR?)?)?)?Jipijapa
Okay! I get it... you need ...(S(T(AR?)?)?)?... because otherwise, you have to consume characters after S, ST, STA and STAR... This is freaking genius.Jipijapa
Not sure what you mean by that. A substring of START is allowed before the END delimiter and up through there we have been preventing these substrings from matching.Bal
I don't understand the answer. My question was why did you need to have this part (S(T(AR?)?)?)? but I think the reason is that otherwise you won't match something like STARTSTAREND. The (S(T(AR?)?)?)? let's you cleanly consume any substring of STARthat comes directly before END.Jipijapa
Yes, exactly. Earlier in the match, we allow STAR if it is followed by something which isn't T, but just before the end delimiter we also allow it to be followed by nothing. (Using "consume" in this context is a bit weird, IMHO, though.)Bal
Thanks for prodding me, I think I found a bug, though it's not directly related to this. I'll try to fix it tomorrow.Bal
See also #406730Bal
S
9

Try this

START(?!.*START).*?END

See it here online on Regexr

(?!.*START) is a negative lookahead. It ensures that the word "START" is not following

.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)

Update:

I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version

START(?!.*START).*END

this will match till the last "END".

Saponify answered 7/9, 2011 at 11:39 Comment(4)
+1 for good answer with simple explanations of all the operatorsFloorboard
This will fail if there is more than one START...END pair in the string. (Or more precisely, it will only find the last START...END pair in the string.)Interposition
To clarify Tim's comment: your regexp will NOT match where you expect it to if there is ANY second occurrence of START, be it before or after END (e.g. abcSTARTabcENDxyzSTART will not match)Stasiastasis
Yeah, it simply asks if there is any occurrence of start in the future and if so, will not match. This is not the wanted (described) behavior.Jipijapa
I
7
START(?:(?!START).)*END

will work with any number of START...END pairs. To demonstrate in Python:

>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']

If you only care for the content between START and END, use this:

(?<=START)(?:(?!START).)*(?=END)

See it here:

>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
Interposition answered 5/10, 2011 at 13:27 Comment(1)
Yup, This will do it. +1 (Although you may want to mention/use the s dot-matches-all flag.)Romonaromonda
B
4

The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.

Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.

Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.

Bal answered 7/9, 2011 at 11:50 Comment(9)
Nice solution (if no lookaheads possible) +1Saponify
This is what I was looking for, thanks. Indeed ... pedestrian :) but it works. I was hoping that there might be an easier way that I am missing. Sorry for not posting back earlier.Gradely
What is the last part for? Why do you need (S(T(AR?)?)?)?Jipijapa
Okay! I get it... you need ...(S(T(AR?)?)?)?... because otherwise, you have to consume characters after S, ST, STA and STAR... This is freaking genius.Jipijapa
Not sure what you mean by that. A substring of START is allowed before the END delimiter and up through there we have been preventing these substrings from matching.Bal
I don't understand the answer. My question was why did you need to have this part (S(T(AR?)?)?)? but I think the reason is that otherwise you won't match something like STARTSTAREND. The (S(T(AR?)?)?)? let's you cleanly consume any substring of STARthat comes directly before END.Jipijapa
Yes, exactly. Earlier in the match, we allow STAR if it is followed by something which isn't T, but just before the end delimiter we also allow it to be followed by nothing. (Using "consume" in this context is a bit weird, IMHO, though.)Bal
Thanks for prodding me, I think I found a bug, though it's not directly related to this. I'll try to fix it tomorrow.Bal
See also #406730Bal
T
3

May I suggest a possible improvement on the solution of Tim Pietzcker? It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.

Territerrible answered 4/6, 2014 at 8:5 Comment(0)
F
0

[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct. (?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END) as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]

The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.

That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:

abcSTARTdefSTARTghiENDjkl

you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:

abcSTARTdefBEGINghiFINISHjkl

which would allow you to change your START/END tokens only when paired properly.

Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.

See the Java regex documentation for full details on Java regexes.

Floorboard answered 7/9, 2011 at 12:11 Comment(2)
You fail on the pattern STARTSTAEND.Bal
@Bal sigh, yes, indeed and I would need to ignore those characters with ?! which kinda defeats the whole purpose. thank you for pointing it out.Floorboard

© 2022 - 2024 — McMap. All rights reserved.