Regular Expressions and negating a whole character group [duplicate]

M

9

282

I'm attempting something which I feel should be fairly obvious to me but it's not. I'm trying to match a string which does NOT contain a specific sequence of characters. I've tried using [^ab], [^(ab)], etc. to match strings containing no 'a's or 'b's, or only 'a's or only 'b's or 'ba' but not match on 'ab'. The examples I gave won't match 'ab' it's true but they also won't match 'a' alone and I need them to. Is there some simple way to do this?

Mclain answered 10/6, 2009 at 18:4 Comment(1)

@finnw maybe he was refering to it into the context of https://mcmap.net/q/110104/-not-group-in-regex/3186555? – Celloidin 21/4, 2016 at 1:24

B

236

Use negative lookahead (cf. Regexr.com explanation):

^(?!.*ab).*$

UPDATE: In the comments below, I stated that this approach is slower than the one given in Peter's answer. I've run some tests since then, and found that it's really slightly faster. However, the reason to prefer this technique over the other is not speed, but simplicity.

The other technique, described here as a tempered greedy token, is suitable for more complex problems, like matching delimited text where the delimiters consist of multiple characters (like HTML, as Luke commented below). For the problem described in the question, it's overkill.

For anyone who's interested, I tested with a large chunk of Lorem Ipsum text, counting the number of lines that don't contain the word "quo". These are the regexes I used:

(?m)^(?!.*\bquo\b).+$

(?m)^(?:(?!\bquo\b).)+$

Whether I search for matches in the whole text, or break it up into lines and match them individually, the anchored lookahead consistently outperforms the floating one.

Bonus answered 10/6, 2009 at 18:10 Comment(15)

I believe this is more efficient: (?:(?!ab).)* – Ida 10/6, 2009 at 18:12

Also wants to use start/end markers to enforce the check on the whole string. – Spiritualize 10/6, 2009 at 18:15

@Blixit: yes, it is. But it's also harder to read, especially for regex newbies. The one I posted will be efficient enough for most applications. – Bonus 10/6, 2009 at 18:24

@Peter: I was fixing that as you posted the comment. Anchors aren't necessary in all cases (eg, when using Java's matches() method), but they don't hurt anything either. – Bonus 10/6, 2009 at 18:27

Don't write code aimed at newbies! If code is hard to read, leave comments/documentation so they can learn, instead of using lesser code that keeps them ignorant. – Spiritualize 10/6, 2009 at 18:29

If I had thought there would be a noticeable difference between the two approaches, I wouldn't have hesitated to recommend the faster one. On the other hand, regexes are so opaque (if not cryptic), I think it's worthwhile to break the knowledge into smaller, more manageable chunks whenever possible. – Bonus 10/6, 2009 at 18:56

In my case the second one worked, and the first one didn't. I was trying to match certain <td> .. </td> elements that contained windows somewhere between the start and end tags and not match the TD elements that didn't. I used <td(?:(?!</td>).)+</td> to find the whole TD element where <td(?!.*</td>).*</td> wouldn't work. Final regex was <td(?:(?!</td>).)+windows.*?</td> . For a good example of "breaking the knowledge into smaller chunks" see below where the explanation of the regex characters used is included in the answer. – Cade 25/3, 2013 at 0:48

@Luke: That is a very different problem. You're searching for a substring that starts with AAA and ends with BBB, that does not contain any other instances of AAA or BBB, but does contain CCC (where AAA, BBB and CCC are arbitrary multi-character sequences). This question is about matching a whole string that does not contain AAA. Peter's approach works here too, but this approach is just as valid, and a little more intuitive. – Bonus 25/3, 2013 at 15:23

If you break my problem down, before worrying about the windows part, i first needed to find one complete <td>...</td> tag, which involved, as you say, finding the middle section "that does not contain any instances of aaa or bbb" (this part pretty closely matches the question). I was just adding my experience to try to help anyone else who arrives at this page. – Cade 25/3, 2013 at 22:0

You just saved my life. I have been banging my head against a wall for an hour trying to match simple out-of-place Javascript inline comments (<code> // <comment>) except I wanted to ignore matches where the comment was at the start of the line; I only wanted to move the comment up to its own line. Thanks to you I got it complete in short order. – Derangement 19/4, 2016 at 16:51

if you are planning to not have any "ab" s at all, (?:(?:(?!.*ab).*).)* is going to be best – Celloidin 20/4, 2016 at 8:57

why is first .* needed in ^(?!.*ab).*$ ? In my use case it seems to work without it as well (I use it as (?!\d+).+ ) – Rhinoscopy 15/7, 2021 at 13:22

A faster implementation would remove the negative lookahead: ^(?:[^a]*a[^b])*[^a]*$. Lookahead is expensive. Note this is hard to generalize. Read as "match not 'a' until an 'a' is seen, then match not 'b'; do logic repeatedly then finally match any string of not 'a' until the end" – Ajay 24/8, 2021 at 4:10

In Rust look-aheads are not supported, is there an alternative? – Hirz 30/8, 2021 at 8:20

You can use a look-ahead in a middle part within your regex, to exclude a character group in a sub section of the string to match. This isn't evident from all the answers. – Tableau 5/11, 2023 at 17:51

S

456

Using a character class such as [^ab] will match a single character that is not within the set of characters. (With the ^ being the negating part).

To match a string which does not contain the multi-character sequence ab, you want to use a negative lookahead:

^(?:(?!ab).)+$

And the above expression disected in regex comment mode is:

(?x)    # enable regex comment mode
^       # match start of line/string
(?:     # begin non-capturing group
  (?!   # begin negative lookahead
    ab  # literal text sequence ab
  )     # end negative lookahead
  .     # any single character
)       # end non-capturing group
+       # repeat previous match one or more times
$       # match end of line/string

Spiritualize answered 10/6, 2009 at 18:11 Comment(6)

Dissecting the regex was very helpful for me. Thank you. – Gast 2/11, 2016 at 19:9

..and for replacing it, probably just ^((?!ab).+)$. – Svetlanasvoboda 30/11, 2017 at 23:16

A small note. The . from the "any single character" is only for the same line. If you need to do this to multi-line regex, you may need to replace it to (.|\n) – Heliotropin 26/2, 2020 at 2:36

Thanks for that - very informative. Having played with it, I think it's worth noting that the 'ab' that you've described as, and in your example is, a "literal text sequence" can in fact be a complex regular expression. So if you have a regex that matches some pattern in strings, then wrap that regex inside '^(?:(?!' and ').)+$', the resulting regex will match strings that do not contain a match for the original regex. – Apelles 9/9, 2022 at 11:43

The Debug feature of RegexBuddy does a great job of illustrating how the negative lookahead works. – Myrilla 5/6, 2023 at 17:41

this fails, if the string contains whitespace, check this demo. – Schoolboy 26/1 at 21:40

B

236

Use negative lookahead (cf. Regexr.com explanation):

^(?!.*ab).*$

UPDATE: In the comments below, I stated that this approach is slower than the one given in Peter's answer. I've run some tests since then, and found that it's really slightly faster. However, the reason to prefer this technique over the other is not speed, but simplicity.

The other technique, described here as a tempered greedy token, is suitable for more complex problems, like matching delimited text where the delimiters consist of multiple characters (like HTML, as Luke commented below). For the problem described in the question, it's overkill.

For anyone who's interested, I tested with a large chunk of Lorem Ipsum text, counting the number of lines that don't contain the word "quo". These are the regexes I used:

(?m)^(?!.*\bquo\b).+$

(?m)^(?:(?!\bquo\b).)+$