Regex: Matching by exclusion, without look-ahead - is it possible?

Asked 21/1, 2009 at 16:50 Answered 10/5, 2018 at 15:18

In some regex flavors, [negative] zero-width assertions (look-ahead/look-behind) are not supported.

This makes it extremely difficult (impossible?) to state an exclusion. For example "every line that does not have "foo" on it", like this:

^((?!foo).)*$

Can the same thing be achieved without using look-around at all (complexity and performance concerns set aside for the moment)?

Closing answered 21/1, 2009 at 16:50 Comment(1)

Click the "regex-negation" tag to see some similar questions. – Lilybel 9/2, 2010 at 12:59

UPDATE: It fails "with two ff before oo" as @Ciantic pointed out in the comments.

^(f(o[^o]|[^o])|[^f])*$

NOTE: It is much much easier just to negate a match on the client side instead of using the above regex.

The regex assumes that each line ends with a newline char if it is not then see C++'s and grep's regexs.

Sample programs in Perl, Python, C++, and grep all give the same output.

perl

#!/usr/bin/perl -wn
print if /^(f(o[^o]|[^o])|[^f])*$/;

python

#!/usr/bin/env python
import fileinput, re, sys
from itertools import ifilter

re_not_foo = re.compile(r"^(f(o[^o]|[^o])|[^f])*$")
for line in ifilter(re_not_foo.match, fileinput.input()):
    sys.stdout.write(line)

c++

#include <iostream>
#include <string>
#include <boost/regex.hpp>

int main()
{
  boost::regex re("^(f(o([^o]|$)|([^o]|$))|[^f])*$");
  //NOTE: "|$"s are there due to `getline()` strips newline char

  std::string line;
  while (std::getline(std::cin, line)) 
    if (boost::regex_match(line, re))
      std::cout << line << std::endl;
}

grep

$ grep "^\(f\(o\([^o]\|$\)\|\([^o]\|$\)\)\|[^f]\)*$" in.txt

Sample file:

foo
'foo'
abdfoode
abdfode
abdfde
abcde
f

fo
foo
fooo
ofooa
ofo
ofoo

Output:

abdfode
abdfde
abcde
f

fo
ofo

Jolty answered 21/1, 2009 at 17:16 Comment(10)

That's it. This is so simple that I'm ashamed not having thought of it myself. Thank you very much. – Closing 21/1, 2009 at 18:15

It's clear that post processing in the program, negating the match is the preferred method. Sometimes you don't have a choice, and even if you have, it is good to know your alternatives. – Closing 21/1, 2009 at 18:26

This regular expression is not correct. It does not match f, fo or barf. But this one does: ^(f(o([^o]|$)|[^o]|$)|[^f])*$ – Metropolitan 7/2, 2010 at 20:4

@Gumbo: See regexps used for C++ and grep (hint: they are the same). – Jolty 8/2, 2010 at 14:36

@J.F. Sebastian: Ah, you’re right. I wonder why he didn’t change the other ones. – Metropolitan 8/2, 2010 at 14:38

@Gumbo: I didn't change the other ones because the text in the answer clearly states: The regex assumes that each line _ends with a newline char_ if it is not then see C++'s and grep's regexes. Notice: "ends with a newline" and "see C++'s regexp [otherwise]". – Jolty 27/10, 2011 at 21:44

Answer does not seem to work with somethingffoosomething with two ff before oo. – Muskellunge 24/10, 2017 at 11:31

@Ciantic: you are right e.g., both perl and grep solutions shouldn't print anything (because both input lines have 'foo'). It just confirms the note at the top of the answer. – Jolty 24/10, 2017 at 11:41

There is however a way to do this, if one could just figure out how! I try to find a negation for two chars first. – Muskellunge 24/10, 2017 at 11:46

nice answer, but the fact that foo has 2 similar characters doesn't make the Q&A generic. It would be better with abc – Grendel 20/12, 2017 at 8:45

Came across this Question and took the fact that there wasn't a fully-working regex as a personal challenge. I believe I've managed to create a regex that does work for all inputs - provided you can use atomic grouping/possessive quantifiers.

Of course, I'm not sure if there are any flavours that allow atomic grouping but not lookaround, but the Question asked if it's possible in regex to state an exclusion without lookaround, and it is technically possible:

\A(?:$|[^f]++|f++(?:[^o]|$)|(?:f++o)*+(?:[^o]|$))*\Z

Explanation:

\A                         #Start of string
(?:                        #Non-capturing group
    $                      #Consume end-of-line. We're not in foo-mode.
    |[^f]++                #Consume every non-'f'. We're not in foo-mode.
    |f++(?:[^o]|$)          #Enter foo-mode with an 'f'. Consume all 'f's, but only exit foo-mode if 'o' is not the next character. Thus, 'f' is valid but 'fo' is invalid.
    |(?:f++o)*+(?:[^o]|$)  #Enter foo-mode with an 'f'. Consume all 'f's, followed by a single 'o'. Repeat, since '(f+o)*' by itself cannot contain 'foo'. Only exit foo-mode if 'o' is not the next character following (f+o). Thus, 'fo' is valid but 'foo' is invalid.
)*                         #Repeat the non-capturing group
\Z                         #End of string. Note that this regex only works in flavours that can match $\Z

If, for whatever reason, you can use atomic grouping but not possessive quantifiers nor lookaround, you can use:

\A(?:$|(?>[^f]+)|(?>f+)(?:[^o]|$)|(?>(?:(?>f+)o)*)(?:[^o]|$))*\Z

As others point out, though, it's probably more practical to just negate a match through other means.

Pentheas answered 10/5, 2018 at 15:18 Comment(5)

It's actually possible to express ^(?!.*foo) without any extended regex features :D The solution in this case is: ^([^f]|(f+o)*f+([^fo]|o([^fo]|$)|$))*$. We can even quite elegantly extend that to any arbitrary substring "foo"... I'll post a detailed write-up about this soon! – Thier 9/9, 2018 at 14:50

Reading these is made infinitely more complicated by the double 'oo' could someone who understands this please create one that does not contain that. – Worrell 7/5, 2021 at 17:43

@Worrell You want a non-lookaround regex to find, say, lines not containing "sna"? \A(?:$|[^s]++|s++(?:[^n]|$)|(?:s++n)*+(?:[^a]|$))*\Z – Pentheas 7/5, 2021 at 17:56

Is there regex builder that would take a alphanumeric string and search lines NOT containing it (not using lookaround)? What about adding additional words to the search? – Felisha 15/11, 2022 at 14:9

@JonGrah I'd suggest making a new Question to ask that. The actual practical advice would likely be the same, though - if you can't use lookaround, then just write non-regex code to do it. – Pentheas 16/11, 2022 at 17:54

I stumbled across this question looking for my own regex exclusion solution, where I am trying to exclude a sequence within my regex.

My initial reaction to this situation: For example "every line that does not have "foo" on it" was simply to use the -v invert sense of matching option in grep.

grep -v foo

this returns all lines in a file that don't match 'foo'

It's so simple I have the strong feeling I've just misread your question....

Edelweiss answered 6/8, 2009 at 17:5 Comment(4)

grep -v foo searches for "foo" and negates the result, and the OP said he wanted the regex itself to do the work. But suppose the requirement were "contains 'foo' and doesn't contain 'bar'", and you could only perform one regex match? Simply negating the result wouldn't be an option then. – Industrial 7/8, 2009 at 4:1

@Alan: True, but why the (seemingly) arbitrary one-regexp-match limit? If we're not limited to one match, then we can pipeline: grep foo <file> | grep -v bar. I bring this up because I could not figure out the examples above and make them work in Emacs, but was able to do just this on the command line. – Benyamin 14/10, 2011 at 23:48

@ZacharyYoung: Of course grep -v or equivalent is the best way to go, if available. But the OP was talking about the hypothetical situation where you can't invert the match and you can't use lookaheads. Fortunately, situations like that are extremely rare in the real world. ;) – Industrial 15/10, 2011 at 6:6

@AlanMoore extremely rare you say? Here's one, and to think it's been 7 years since the question was asked. – Synopsis 14/3, 2016 at 11:43

You can usually look for foo and invert the result of the regex match from the client code.

For a simple example, let's say you want to validate that a string contains only certain characters.

You could write that like this:

^[A-Za-z0-9.$-]*$

and accept a true result as valid, or like this:

[^A-Za-z0-9.$-]

and accept a false result as valid.

Of course, this isn't always an option: sometimes you just have to put the expression in a config file or pass it to another program, for example. But it's worth remembering. Your specific problem, for example, the expression is much simpler if you can use negation like this.

Jinx answered 21/1, 2009 at 16:57 Comment(1)

I know that post-processing would solve the problem... That's what I was trying to avoid, I was looking for a vanilla regex that does the right thing. Additionally, I am looking for something that disallows a specific sequence of characters, not an unordered set. – Closing 21/1, 2009 at 17:3

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags