REGEXP: capture group NOT followed by
Asked Answered
C

3

5

I need to match following statements:

Hi there John
Hi there John Doe (jdo)

Without matching these:

Hi there John Doe is here 
Hi there John is here

So I figured that this regexp would work:

^Hi there (.*)(?! is here)$

But it does not - and I am not sure why - I believe this may be caused by the capturing group (.*) so i thought that maybe making * operator lazy would solve the problem... but no. This regexp doesn't work too:

^Hi there (.*?)(?! is here)$

Can anyone point me in the solutions direction?

Solution

To retrieve sentence without is here at the end (like Hi there John Doe (the second)) you should use (author @Thorbear):

^Hi there (.*$)(?<! is here)

And for sentence that contains some data in the middle (like Hi there John Doe (the second) is here, John Doe (the second) being the desired data)simple grouping would suffice:

^Hi there (.*?) is here$

.

           ╔══════════════════════════════════════════╗
           ║▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒║
           ║▒▒▒Everyone, thank you for your replies▒▒▒║
           ║▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒║
           ╚══════════════════════════════════════════╝
Collen answered 1/8, 2012 at 14:21 Comment(3)
Will "is here" necessarily be at the end of a line, or do you want to prevent it from occurring anywhere?Ceremony
FYI, the capturing group is not relevant. The regex will match exactly the same without it, it just won't capture anything.Novella
What I want to do is write two regexps one that matches sentence without "is here" exactly at the end of sentence. Solution to that is either what Thorbear had written ^Hi there (.*)(?<! is here)$ or what @Ceremony has written ^Hi there ((?! is here).)*$ but for my usage first version is more appropriate Second thing I want to do is find sentences that have structure like this <pre>Hi there James is here</pre> and solution to that is simply ^Hi there (.*) is here$ Thank you all for replying!Collen
W
4

the .* will find a match regardless of being greedy, because at the end of the line, there is no following is here (naturally).

A solution to this could be to use lookbehind instead (checking from the end of the line, if the past couple of characters matches with is here).

^Hi there (.*)(?<! is here)$

Edit

As suggested by Alan Moore, further changing the pattern to ^Hi there (.*$)(?<! is here) will increase the performance of the pattern because the capturing group will then gobble up the rest of the string before attempting the lookbehind, thus saving you of unnecessary backtracking.

Worship answered 1/8, 2012 at 14:39 Comment(8)
Good point. You should be aware, however, that some regex tools won't allow testing this if they are written in Javascript as it has issues with lookbehind.Remount
+1, but I'd change the position of the anchor: (.*$)(?<! is here).Novella
@AlanMoore Any specific reasoning for why you would do that?Worship
This raises a good point. This assumes "is here" must occur at the end of a string, whereas I assumed it was not supposed to occur anywhere (allowing other characters to follow "is here").Ceremony
@BlackVegetable: You would want to use a Java-based tester anyway, because even flavors that support lookbehind tend to support it differently.Novella
@AlanMoore Good to think about. Thank you for the comment.Remount
@Thorbear: Truthfully? Because that's how I was writing it when your answer appeared. :D I try to structure my regexes to reflect what's really happening inside them, the better to avoid needless backtracking. The (.*$) says to me, "consume (and capture) the rest of the string, and whatever happens with the lookbehind, don't back off." It's probably just me being anal-retentive in this case, but it doesn't hurt anything.Novella
@AlanMoore Hehe, and I was truthfully just acting on a hunch that lookbehind might work, without giving much regard to the rest of the pattern. As such, I was also hoping you were going to provide some knowledge that your suggestion would improve performance. Since you didn't I decided to conduct a small test, and found that you pattern would provide better performance if the string doesn't match (as one would expect with backtracking), see test-code here: pastebin.com/fJQeYL2RWorship
C
3

It's not entirely clear from your example if you want to prevent " is here" from occurring anywhere or just at the end of a line. If it should not occur anywhere, try this:

^Hi there ((?! is here).)*$

Before each character, it checks to see that the next characters are not " is here".

Alternatively, if you only want to exclude it if it occurs at the very end of a line, you could use a negative lookbehind as Thorbear suggested:

^Hi there (.*)(?<! is here)$ 

You're absolutely right why your expression matched all of the input lines. .* matched everything, and the lookahead (?! is here)$ would always be true because " is here" would never occur after the end of a line (because nothing will be there).

Ceremony answered 1/8, 2012 at 14:35 Comment(0)
T
1

You don't need to solve your problem with regex, you merely need to use regex to find out if the non-intended regex matches. Of course, if you already know this and are simply looking to learn about lookaheads/lookbehinds, you can discard the rest of this answer.

If you take the regex you don't want your input strings to match:

badregex = (Hi there (.*)(is here))

This will give you a match for

Hi there, John is here

So you can just put the logic at application level, where it should be (logic in regexes is a bad bad thing). A bit of pseudocode (I cba write out Java right now, but you get the idea)

if (badregex.exactMatch(your_str))
   discardString();
   return;
if (goodregex.exactMatch(your_str))
   doStuff(your_str);
Thirzia answered 1/8, 2012 at 14:39 Comment(5)
I get the impression the asker wasn't worried so much about this particular application as much as understanding regex though.Remount
Well, regex manuals will explain regex much better than I could ever hope to. I on the other hand, am trying to stop the OP from falling into the "put-regex-together-with-logic" trap that I've fallen into multiple times.Thirzia
Ah, fair enough. I wasn't trying to discount your answer, and you make a good point!Remount
and your comments weren't interpreted that way either :)Thirzia
@Arnab Datta, you have got a point but what I really use this for is a cucumber based acceptance tests therefore I am forced to use regexps and I don't mind it at all. Regular expressions are complicated but without any doubts are a very powerful tool! Thanks for idea though!Collen

© 2022 - 2024 — McMap. All rights reserved.