A Regex that will never be matched by anything
Asked Answered
P

30

167

What's your thought - what does a Regex look like, that will never be matched by any string, ever!

Edit: Why I want this? Well, firstly because I find it interesting to think of such an expression and secondly because I need it for a script.

In that script I define a dictionary as Dictionary<string, Regex>. This contains, as you see, a string and an expression.

Based on that dictionary I create methods that all use this dictionary as only reference on how they should do their work, one of them matches the regexes against a parsed logfile.

If an expression is matched, another Dictionary<string, long> is added a value that is returned by the expression. So, to catch any log-messages that are not matched by an expression in the dictionary I created a new group called "unknown".

To this group everything that didn't match anything other is added. But to prevent the "unknown"-expression to mismatch (by accident) a log-message, I had to create an expression that is most certainly never matched, no matter what string I give it.

Pontonier answered 12/11, 2009 at 15:46 Comment(19)
Note that it is very hard to prove a negative.Jacqulynjactation
I don't think it should have been closed, but I don't want to vote to reopen without some background info.Premillenarian
Well, as you see: "Not a real question"... don't know more either as it was a real question that I actually needed for a project...Pontonier
@ApoY2k if you specify the use case, it might get reopened.Paint
I'll try to, still it's quite specific...Pontonier
regex actually isn't the same in phyton / javascript / php / whatever, so in what language are you trying to use it?Carcajou
also, post the language in the title and tags, so that you can be helped out easier.Carcajou
Interesting. Where would you use such a regex?Untutored
yoda, as mentioned above, I'm using Python and Javascript, but don't want to limit the responses only to those languages. I'd be happy to examine solutions in other syntaxes, and believe I could port the solution easily enough. Yes, different languages aren't identical, but they're pretty darn close in this area, 9 times out of 10.Musing
Charlie, the most recent use case was where I'm building a regex programmatically, with groups like (foo|bar|baz) built from external inputs. I want all groups to be present, but some may have no external input and should thus never match. If I do nothing, I'll have empty groups of () which, at least in Python, match between every character. I want all groups to be present so a regex.sub() call with a callback routine performing the replacement can be simplified, both for readability and for performance in a loop.Musing
this has been asked #1723682Paint
Why the complexity tag? I cannot see how it applies here.Padauk
I'll note here for the record that many of the comments above, and answers to this question, were originally from #1845578 which is one I asked. Marc Gravell merged them, which I think makes many of these responses kind of bizarre without the precise original context, to the point some comments don't appear to make sense. (Probably also steals away potential future rep points, too.) I would suggest that questions with such detailed backgrounds could never be "exact duplicates". Whatever...Musing
here's another reason to use this, specific to perl: it's something to put on one side of the conditional regex construct if ones of the results you want is "don't match at all". e.g. s/(?(?{ defined $ENV{FOO} })foo|(*F))/bar/g, "substitute bar for foo if $FOO, otherwise do nothing"Corot
This question has been added to the Stack Overflow Regular Expressions FAQ, under "Advanced Regex-Fu".Remake
@CharlieSalts: I need this because I have a class that captures all input lines starting with the line matching regex 1, ending with the one matching regex 2. If regex 2 is impossible-to-match, this allows them to get all lines after the start-line. Sort of like the {min,max} and {min,} regex quantifiers.Remake
"Note that it is very hard to prove a negative" -- this is widely believed yet utterly and obviously false ... as we've known at least since Euclid proved that there's no greatest prime. And any proof of P is a proof of the negation of (not P). What is true is that it's difficult to prove an empirical universal, positive or negative, e.g., "all ravens are black" or "no raven is white". Algorithms are analytical, not empirical, so this is a particularly bad misapplication of the bogus rule. e.g., a proof that the pattern 'a' doesn't match any string that starts with 'b' is not "very hard".Mitsukomitt
" ^.{0}$ " I believe that this would probably be the only expression actually ever capable of never matching anything. Due to the simple fact that if it does actually have the possibility of matching anything, than it also means that there is nothing being matched.Toaster
Even better and simpler than that one. ` ^{0} ` No invalid syntax errors and even if you added any character to the end it wouldn't matter. ` ^{0}.* ` with or without anchoring $ at the end.Toaster
B
79

This is actually quite simple, although it depends on the implementation / flags*:

$a

Will match a character a after the end of the string. Good luck.

WARNING:
This expression is expensive -- it will scan the entire line, find the end-of-line anchor, and only then not find the a and return a negative match. (See comment below for more detail.)


* Originally I did not give much thought on multiline-mode regexp, where $ also matches the end of a line. In fact, it would match the empty string right before the newline, so an ordinary character like a can never appear after $.

Backfill answered 12/11, 2009 at 15:46 Comment(11)
This expression is expensive -- it will scan the entire line, find the end-of-line anchor, and only then not find the "a" and return a negative match. I see it take ~480ms to scan a ~275k line file. The converse "a^" takes about the same time, even if it might seem more efficient. On the other hand, a negative lookahead need not scan anything: "(?!x)x" (anything not followed by an x also followed by an x, i.e. nothing) takes about 30ms, or less than 7% of the time. (Measured with gnu time and egrep.)Berbera
In Perl that will match the current value of $a. It's Perl equivalent $(?:a) is also very slow perl -Mre=debug -e'$_=a x 50; /$(?:a)/'.Utilitarian
@Berbera , please see my answer regarding timing, as I found the exact opposite measured with timeit and python3.Lossa
It's not shocking that six years and a major version of Python might change things.Berbera
Here's a JavaScript comparison of some methods covered here: jsperf.com/regex-that-never-matchesNatalyanataniel
In POSIX BRE syntax, $a will match the literal text $a, because $ is invalid as an anchor in that pattern.Thwart
What about a^? It shouldn't match anything either and it is in the beginning of the string.Laburnum
@VladimirKondenko IIRC It will still scan the string looking for as, but ^o^ would work, I guess.Lanciform
I tried using "a^", but this does not work: That string matches itself! Apparently a ^ that's not at the beginning of a regular expression matches a normal ^ character. Instead, I found this to work: "^(?!x)x"Shanel
An anchor isn't an anchor if it isn't being used as an anchor. Inside of a regular expression the value: ^.*$ specify that the characters [$^] are being used for their "special purposes" of matching the empty values of their designated side. Notice how, $.*^ won't completely disregard and overpower syntax rules, anchoring themselves flipped inside out dumping core. If the ^ isn't the first character of the regular expression, nor the first character of a range meaning not, such as [^a-z], than anywhere else has no special meaning. Likewise with a $, but as the last character of the regex.Toaster
echo '$a$a$a$a$a$a' | grep '$a' Am I doing something wrong maybe?Toaster
L
87

Leverage negative lookahead:

>>> import re
>>> x=r'(?!x)x'
>>> r=re.compile(x)
>>> r.match('')
>>> r.match('x')
>>> r.match('y')

this RE is a contradiction in terms and therefore will never match anything.

NOTE:
In Python, re.match() implicitly adds a beginning-of-string anchor (\A) to the start of the regular expression. This anchor is important for performance: without it, the entire string will be scanned. Those not using Python will want to add the anchor explicitly:

\A(?!x)x
Lost answered 4/12, 2009 at 5:46 Comment(7)
@Chris, yep -- also, (?=x)(?!x) and so on (concatenations of contradictory lookaheads, and same for lookbehinds), and many of those also work for arbitrary values of x (lookbehinds need xs that match strings of fixed-length).Lost
Appears to work well. But what about just (?!) instead? Since () will always match, wouldn't (?!) be guaranteed never to match?Musing
@Peter, yes, if Python accepts that syntax (and recent releases appear to), then it would be self-contradictory as well. Another idea (not quite as elegant, but the more ideas you get the likelier you are to find one working across all RE engines of interest): r'a\bc', looking for a word-boundary immediately surrounded by letters on both sides (variant: nonword characters on both sides).Lost
(?!) does seem to work with Python, but not with Javascript (in Firefox 3.5). Modifying it to (?!()) works for both. I'm not sure I'd want to rely on it in JS though. Performance could also be a consideration... some of these may be relatively slow, if they test at every character, while others may short-circuit the test.Musing
Interestingly, my original with a simple literal that I "know" won't appear in my input turns out to be fastest, in Python. With a 5MB input string, and using this in a sub() operation, (?!x)x takes 21% longer, (?!()) is 16%, and ($^) 6% longer. May be significant in some cases, though not in mine.Musing
That can be quite slow perl -Mre=debug -e'$_=x x 8; /(?!x)x/'. You can make it faster by anchoring it at the beginning \A(?!x)x or at the end (?!x)x\z. perl -Mre=debug -e'$_=x x 8; /(?!x)x\z/; /\A(?!x)x/'Utilitarian
@Brad Gilbert, It turns out that Python's re.match() implicitly adds the \A to the beginning of the regexp. The rest of us, as you note, need to add the \A explicitly.Maternal
B
79

This is actually quite simple, although it depends on the implementation / flags*:

$a

Will match a character a after the end of the string. Good luck.

WARNING:
This expression is expensive -- it will scan the entire line, find the end-of-line anchor, and only then not find the a and return a negative match. (See comment below for more detail.)


* Originally I did not give much thought on multiline-mode regexp, where $ also matches the end of a line. In fact, it would match the empty string right before the newline, so an ordinary character like a can never appear after $.

Backfill answered 12/11, 2009 at 15:46 Comment(11)
This expression is expensive -- it will scan the entire line, find the end-of-line anchor, and only then not find the "a" and return a negative match. I see it take ~480ms to scan a ~275k line file. The converse "a^" takes about the same time, even if it might seem more efficient. On the other hand, a negative lookahead need not scan anything: "(?!x)x" (anything not followed by an x also followed by an x, i.e. nothing) takes about 30ms, or less than 7% of the time. (Measured with gnu time and egrep.)Berbera
In Perl that will match the current value of $a. It's Perl equivalent $(?:a) is also very slow perl -Mre=debug -e'$_=a x 50; /$(?:a)/'.Utilitarian
@Berbera , please see my answer regarding timing, as I found the exact opposite measured with timeit and python3.Lossa
It's not shocking that six years and a major version of Python might change things.Berbera
Here's a JavaScript comparison of some methods covered here: jsperf.com/regex-that-never-matchesNatalyanataniel
In POSIX BRE syntax, $a will match the literal text $a, because $ is invalid as an anchor in that pattern.Thwart
What about a^? It shouldn't match anything either and it is in the beginning of the string.Laburnum
@VladimirKondenko IIRC It will still scan the string looking for as, but ^o^ would work, I guess.Lanciform
I tried using "a^", but this does not work: That string matches itself! Apparently a ^ that's not at the beginning of a regular expression matches a normal ^ character. Instead, I found this to work: "^(?!x)x"Shanel
An anchor isn't an anchor if it isn't being used as an anchor. Inside of a regular expression the value: ^.*$ specify that the characters [$^] are being used for their "special purposes" of matching the empty values of their designated side. Notice how, $.*^ won't completely disregard and overpower syntax rules, anchoring themselves flipped inside out dumping core. If the ^ isn't the first character of the regular expression, nor the first character of a range meaning not, such as [^a-z], than anywhere else has no special meaning. Likewise with a $, but as the last character of the regex.Toaster
echo '$a$a$a$a$a$a' | grep '$a' Am I doing something wrong maybe?Toaster
T
71

One that was missed:

^\b$

It can't match because the empty string doesn't contain a word boundary. Tested in Python 2.5.

Trichromatic answered 20/2, 2010 at 17:29 Comment(3)
This is the best answer. It doesn't use lookaheads, doesn't break under some regex implementations, doesn't use a specific character (e.g. 'a'), and fails in a maximum of 3 processing steps (according to regex101.com) without scanning the whole input string. This is also easy to understand at a glance.Dowable
This actually fails in Emacs in certain conditions (if there is a blank line at the start or end of the buffer), however \`\b\' works, which is substituting the Emacs syntax for "beginning/end of text" (as opposed to "beginning/end of line").Thwart
\A\b\Z should be more performant in the case where MULTILINE flag is being usedFides
P
37

look around:

(?=a)b

For regex newbies: The positive look ahead (?=a) makes sure that the next character is a, but doesn't change the search location (or include the 'a' in the matched string). Now that next character is confirmed to be a, the remaining part of the regex (b) matches only if the next character is b. Thus, this regex matches only if a character is both a and b at the same time.

Paint answered 12/11, 2009 at 15:46 Comment(1)
🆎... your move.Larner
I
33

a\bc, where \b is a zero-width expression that matches word boundary.

It can't appear in the middle of a word, which we force it to.

Intensive answered 12/11, 2009 at 15:46 Comment(1)
If your use-case allows you to anchor the pattern to the beginning of the string, then that enhancement will prevent the regexp engine from searching for and testing every instance of an a in the text.Thwart
A
24

$.

.^

$.^

(?!)

Argil answered 4/12, 2009 at 5:52 Comment(6)
Cute! My subconscious steered me away from ideas like the first three, as they're "illegal"... conceptually, but obviously not to the regex. I don't recognize the (!) one... will have to look that one up.Musing
Okay then, I like the (?!) answer... effectively what Alex suggested. Note that in stackoverflow.com/questions/1723182 (pointed out by Amarghosh above) someone claims "some flavours" of regex would consider that a syntax error. Python likes it fine though. Note that your other suggestions would all fail with re.DOTALL|re.MULTILINE modes in Python.Musing
Has this been tested? I would have assumed that ^ only has special meaning as the first character of a regexp, and $ only has special meaning at the end of a regexp, unless the regular expression is a multi-line expression.Ptosis
Actually in Perl /$./ means something entirely different. It means match the current value of $. (input line number). Even /$(.)/ could match something if you wrote use re '/s'; before it. (perl -E'say "\n" =~ /$(.)/s || 0')Utilitarian
In POSIX BRE syntax, ^ and $ are only special at the beginning and end (respectively) of the pattern, so none of $. or .^ or $.^ would work. (?!) is a Perl/PCRE feature, I believe.Thwart
How about $^ ?Dropkick
D
16
\B\b

\b matches word boundaries - the position between a letter an a non-letter (or the string boundary).
\B is its complement - it matches the position between two letters or between non-letters.

Together they cannot match any position.

See also:

Disinter answered 31/1, 2011 at 11:19 Comment(5)
This seems like an excellent solution, provided it's anchored to a specific point (the beginning of the text would seem sensible). If you don't do that then it's a terrible solution, because every non-word boundary in the text will be tested to see if it is followed by a word boundary! So the sensible version would be something like ^\B\b. In languages where "beginning of text" and "beginning of line" have different syntax, you would want to use the "beginning of text" syntax, otherwise you'll be testing every line. (e.g. in Emacs this would be \`\B\b or "\\`\\B\\b".)Thwart
That said, I've now noted that the stated purpose of this question is to obtain a regexp for use in a group, in which case ^ is problematic in certain regexp syntax (e.g. POSIX BRE) where ^ is only an anchor when it's the first character of the pattern, and otherwise matches a literal ^ character.Thwart
@Thwart - I think you're overthinking it :) - this is a non-practical question, where the goal was to find an interesting answer - not an efficient answer. That said, the pattern can be reject in liner time (with the size of the target string), so it isn't bad for a regex - most pattern here are the same, and even ^ might be linear if it isn't optimized.Disinter
Re: optimisations, I'm willing to ignore a regexp engine which hopes to find "the beginning of the text" at any other position :)Thwart
Also, it's not such an impractical Q&A -- the sole reason I ended up here was to see if anyone could suggest a more efficient solution to my own for the practical purpose of configuring a particular Emacs variable which required a regexp value, but which I wanted to effectively disable.Thwart
U
13

Maximal matching

a++a

At least one a followed by any number of a's, without backtracking. Then try to match one more a.

or Independent sub expression

This is equivalent to putting a+ in an independent sub expression, followed by another a.

(?>a+)a
Utilitarian answered 12/11, 2009 at 15:46 Comment(0)
D
11

How about $^ or maybe (?!)

Dyne answered 12/11, 2009 at 15:46 Comment(4)
A line break will be matched by this expression in the mode where ^ matches the begin and $ the end of a line.Banner
Maybe he meant (?!) - a negative lookahead for an empty string. But some regex flavors will treat that as a syntax error, too.Satiety
An empty string matches the first, at least in JavaScript.Tiffa
In POSIX BRE syntax, $^ will match those literal characters, because the characters are invalid as anchors (i.e. the very reason you used the pattern causes it to not do what you wanted.)Thwart
W
11

Perl 5.10 supports special control words called "verbs", which is enclosed in (*...) sequence. (Compare with (?...) special sequence.) Among them, it includes (*FAIL) verb which returns from the regular expression immediately.

Note that verbs are also implemented in PCRE shortly after, so you can use them in PHP or other languages using PCRE library too. (You cannot in Python or Ruby, however. They use their own engine.)

Wehrle answered 5/12, 2009 at 6:53 Comment(1)
The docs for that at perldoc.perl.org/perlre.html#%28%2AFAIL%29-%28%2AF%29 say "This pattern matches nothing and always fails. It is equivalent to (?!), but easier to read. In fact, (?!) gets optimised into (*FAIL) internally." Interesting, as (?!) is my favourite "pure" answer so far (even though it doesn't work in Javascript). Thanks.Musing
P
7

This seems to work:

$.
Prosperous answered 12/11, 2009 at 15:46 Comment(4)
That’s similar to Ferdinand Beyer’s example.Banner
And it will match in dot-matches-newlines mode.Snapper
In Perl that will actually match against the current input line number $.. In that case you have to resort to $(.) or more equivalently $(?:.).Utilitarian
In POSIX BRE syntax, $. will match a literal $ followed by any character, because $ is invalid as an anchor in that pattern.Thwart
P
5

Empty regex

The best regex to never match anything is an empty regex. But I'm not sure all regex engine will accept that.

Impossible regex

The other solution is to create an impossible regex. I found that $-^ only takes two steps to compute regardless of the size of your text (https://regex101.com/r/yjcs1Z/1).

For reference:

  • $^ and $. take 36 steps to compute -> O(1)
  • \b\B takes 1507 steps on my sample and increase with the number of character in your string -> O(n)

More popular thread about this question:

Preface answered 12/11, 2009 at 15:46 Comment(0)
P
5

So many good answers!

Similar to @nivk's answer, I would like to share performance comparison for Perl for different variants of never-matching regex.

  1. Input: pseudo-random ascii strings (25,000 different lines, length 8-16):

Regex speed:

Total for   \A(?!x)x: 69.675450 s, 1435225 lines/s
Total for       a\bc: 71.164469 s, 1405195 lines/s
Total for    (?>a+)a: 71.218324 s, 1404133 lines/s
Total for       a++a: 71.331362 s, 1401907 lines/s
Total for         $a: 72.567302 s, 1378031 lines/s
Total for     (?=a)b: 72.842308 s, 1372828 lines/s
Total for     (?!x)x: 72.948911 s, 1370822 lines/s
Total for       ^\b$: 79.417197 s, 1259173 lines/s
Total for         $.: 88.727839 s, 1127041 lines/s
Total for       (?!): 111.272815 s, 898692 lines/s
Total for         .^: 115.298849 s, 867311 lines/s
Total for    (*FAIL): 350.409864 s, 285380 lines/s
  1. Input: /usr/share/dict/words (100,000 English words).

Regex speed:

Total for   \A(?!x)x: 128.336729 s, 1564805 lines/s
Total for     (?!x)x: 132.138544 s, 1519783 lines/s
Total for       a++a: 133.144501 s, 1508301 lines/s
Total for    (?>a+)a: 133.394062 s, 1505479 lines/s
Total for       a\bc: 134.643127 s, 1491513 lines/s
Total for     (?=a)b: 137.877110 s, 1456528 lines/s
Total for         $a: 152.215523 s, 1319326 lines/s
Total for       ^\b$: 153.727954 s, 1306346 lines/s
Total for         $.: 170.780654 s, 1175906 lines/s
Total for       (?!): 209.800379 s, 957205 lines/s
Total for         .^: 217.943800 s, 921439 lines/s
Total for    (*FAIL): 661.598302 s, 303540 lines/s

(Ubuntu on Intel i5-3320M, Linux kernel 4.13, Perl 5.26)

Proptosis answered 12/11, 2009 at 15:46 Comment(1)
Here's a JavaScript comparison of some methods covered here: jsperf.com/regex-that-never-matchesNatalyanataniel
A
5

This won't work for Python, and many other languages, but in a Javascript regex, [] is a valid character class that can't be matched. So the following should fail immediately, no matter what the input:

var noMatch = /^[]/;

I like it better than /$a/ because to me, it clearly communicates its intent. And as for when you would ever need it, I needed it because I needed a fallback for a dynamically compiled pattern based on user input. When the pattern is invalid, I need to replace it with a pattern that matches nothing. Simplified, it looks like this:

try {
    var matchPattern = new RegExp(someUserInput);
}
catch (e) {
    matchPattern = noMatch;
}
Aoudad answered 12/11, 2009 at 15:46 Comment(0)
J
5

The fastest will be:

r = re.compile(r'a^')
r.match('whatever')

'a' can be any non-special character ('x','y'). Knio's implementation might be a bit more pure but this one will be faster for all strings not starting with whatever character you choose instead of 'a' because it will not match after the first character rather than after the second in those cases.

Jumper answered 4/12, 2009 at 21:35 Comment(5)
Indeed, (.^) would be roughly 10% slower than (\x00^) in my case.Musing
I'm accepting this, since using any value other than \n as the character is guaranteed never to match, and I see it as slightly more readable (given that relatively few people are regex experts) than the (?!x)x option, though I voted that one up too. In my case, for either option I would need a comment to explain it, so I think I'll just adjust my original attempt to '\x00NEVERMATCHES^'. I get the no-match guarantee of this answer, with my original self-documenting-ness. Thanks to all for answers!Musing
Does this actually work and if so, who decided to break with Unix? In Unix regexps, ^ is special only as the first character and similarly with $. With any Unix tool, that regexp is going to match anything containing the literal string a^.Favour
Heh, that's a good attack. I never tested against that literal string.Jumper
Oh if that breaks Unix regexps, then you'll love >^.Dowable
R
4

Python won't accept it, but Perl will:

perl -ne 'print if /(w\1w)/'

This regex should (theoretically) try to match an infinite (even) number of ws, because the first group (the ()s) recurses into itself. Perl doesn't seem to be issuing any warnings, even under use strict; use warnings;, so I assume it's at least valid, and my (minimal) testing fails to match anything, so I submit it for your critique.

Roguery answered 4/12, 2009 at 5:48 Comment(3)
Theory is always nice, but in practice I think I'd be worried about regular expressions whose descriptions included the word "infinite"!Musing
perl -Mre=debug -e'"www wwww wwwww wwwwww" =~ /(w\1w)/'Utilitarian
@BradGilbert - Running that here (5.10, a bit out of date) produces "regex failed", as the OP requested. Does it match on your system?Roguery
K
4

[^\d\D] or (?=a)b or a$a or a^a

Krucik answered 19/12, 2009 at 20:16 Comment(2)
Thanks. Note that (?!x)x was the first answer given, listed above.Musing
Yes, it seemed I scanned the other answerers too quickly.Krucik
K
3

As professionals mentioned, it depend to Regular Expression Engines and of course a performance benchmark depend to many things including device.

But as a reference about Performance for ECMAScript (Java/Javascript) or PCRE (PHP) the best from top to down is:

  1. [] | ^[] (Fastest) [Just ECMAScript]
  2. $^ (non-multi-line flag) (Fast)
  3. [^\S\s] | ^[^\S\s] | ^[^\W\w] | ^[^\D\d] (Fast)
  4. .^ (non-multi-line flag) (Fast)
  5. (?!\x00)\x00 | ^(?!\x00)\x00 | (?!\0)\0 (Fast)
  6. (?!a)a
  7. (?!) (Slow)
  8. (?=b)a (Slow)
  9. Other examples like \b\B etc... (Slowest)

A live try for Javascript (Not so accurate)

_Note: ^ = \A (PCRE) = at Start (non-multi-line) more info

Kerril answered 12/11, 2009 at 15:46 Comment(3)
Very interesting thanks. Please also add my one: \A[^\w\W]Fides
@Fides you're welcome. About your one, \A (PCRE) = ^ (PCRE | ECMAScript) also [\W\w] = [\S\s] = [\D\d]. So i just mention it, but not adding in the benchmark example.Kerril
Thanks @Kerril I think \A is better because it won't match multiple times in multiline modeFides
R
3

All the examples involving a boundary matcher follows the same recipe. Recipe:

  1. Take any of the boundary matchers: ^,$,\b,\A,\Z,\z

  2. Do opposite to what they are meant for

Examples:

^ and \A are meant for the beginning so don't use them in beginning

^ --> .^
\A --> .\A

\b matches a word boundary so use it in between

\b --> .\b.

$, \Z and \z are meant for the end so don't use them in the end

$ --> $.
\Z --> \Z.
\z --> \z.

Others involve use of lookahead and lookbehind which also work with the same analogy: If you give positive or negative lookahead followed by something opposite

(?=x)[^x]
(?!x)x

If you give positive or negative lookbehind following something opposite

[^x](?<=x)
x(?<!x)

Their could be more such pattern and more such analogies.

Roscoeroscommon answered 12/11, 2009 at 15:46 Comment(0)
L
3

After seeing some of these great answers, @arantius's comment (regarding timing $x vs x^ vs (?!x)x) on the currently accepted answer made me want to time some of the solutions given so far.

Using @arantius's 275k line standard, I ran the following tests in Python (v3.5.2, IPython 6.2.1).

TL;DR: 'x^' and 'x\by' are the fastest by a factor of at least ~16, and contrary to @arantius's finding, (?!x)x was among the slowest (~37 times slower). So the speed question is certainly implementation dependent. Test it yourself on your intended system before committing if speed is important to you.

UPDATE: There is apparently a large discrepancy between timing 'x^' and 'a^'. Please see this question for more info, and the previous edit for the slower timings with a instead of x.

In [1]: import re

In [2]: with open('/tmp/longfile.txt') as f:
   ...:     longfile = f.read()
   ...:     

In [3]: len(re.findall('\n',longfile))
Out[3]: 275000

In [4]: len(longfile)
Out[4]: 24733175

In [5]: for regex in ('x^','.^','$x','$.','$x^','$.^','$^','(?!x)x','(?!)','(?=x)y','(?=x)(?!x)',r'x\by',r'x\bx',r'^\b$'
    ...: ,r'\B\b',r'\ZNEVERMATCH\A',r'\Z\A'):
    ...:     print('-'*72)
    ...:     print(regex)
    ...:     %timeit re.search(regex,longfile)
    ...:     
------------------------------------------------------------------------
x^
6.98 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
------------------------------------------------------------------------
.^
155 ms ± 960 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$x
111 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$.
111 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$x^
112 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$.^
113 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$^
111 ms ± 839 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
(?!x)x
257 ms ± 5.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
(?!)
203 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
(?=x)y
204 ms ± 4.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
(?=x)(?!x)
210 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
x\by
7.41 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
------------------------------------------------------------------------
x\bx
7.42 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
------------------------------------------------------------------------
^\b$
108 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
\B\b
387 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
\ZNEVERMATCH\A
112 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
\Z\A
112 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The first time I ran this, I forgot to raw the last 3 expressions, so '\b' was interpreted as '\x08', the backspace character. However, to my surprise, 'a\x08c' was faster than the previous fastest result! To be fair, it will still match that text, but I thought it was still worth noting because I'm not sure why it's faster.

In [6]: for regex in ('x\by','x\bx','^\b$','\B\b'):
    ...:     print('-'*72)
    ...:     print(regex, repr(regex))
    ...:     %timeit re.search(regex,longfile)
    ...:     print(re.search(regex,longfile))
    ...:     
------------------------------------------------------------------------
y 'x\x08y'
5.32 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
None
------------------------------------------------------------------------
x 'x\x08x'
5.34 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
None
------------------------------------------------------------------------
$ '^\x08$'
122 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
None
------------------------------------------------------------------------
\ '\\B\x08'
300 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
None

My test file was created using a formula for " ...Readable Contents And No Duplicate Lines" (on Ubuntu 16.04):

$ ruby -e 'a=STDIN.readlines;275000.times do;b=[];rand(20).times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > /tmp/longfile.txt

$ head -n5 /tmp/longfile.txt 
unavailable speedometer's garbling Zambia subcontracted fullbacks Belmont mantra's
pizzicatos carotids bitch Hernandez renovate leopard Knuth coarsen
Ramada flu occupies drippings peaces siroccos Bartók upside twiggier configurable perpetuates tapering pint paralyzed
vibraphone stoppered weirdest dispute clergy's getup perusal fork
nighties resurgence chafe
Lossa answered 12/11, 2009 at 15:46 Comment(1)
\B\b is horribly flawed performance-wise (as is every pattern which isn't anchored to a position, but this pattern is particularly bad). Try benchmarking ^\B\b instead.Thwart
L
3
(*FAIL)

or

(*F)

With PCRE and Perl you can use this backtracking control verb that forces the pattern to fail immediately.

Loney answered 12/11, 2009 at 15:46 Comment(0)
A
3

I believe that

\Z RE FAILS! \A

covers even the cases where the regular expression includes flags like MULTILINE, DOTALL etc.

>>> import re
>>> x=re.compile(r"\Z RE FAILS! \A")
>>> x.match('')
>>> x.match(' RE FAILS! ')
>>>

I believe (but I haven't benchmarked it) that whatever the length (> 0) of the string between \Z and \A, the time-to-failure should be constant.

Anomaly answered 10/12, 2009 at 18:18 Comment(0)
R
1

Here's a slight improvement on Knio's answer:

r"\A(?!)"

What this does: (?!) means "fail the match if an empty string exists at the current position in the string to be matched." In regex logic, there are empty strings everywhere in the string to be matched: before the first character, in between every pair of characters, and after the last character. Therefore (?!) always fails to match.

Adding \A improves the speed of the overall failure to match by preventing the regex engine from trying (?!) at every possible position within the string to be matched. This version will always fail to match in constant time, versus O(length of string) time for Knio's version. (Of course this is a non-issue if you're using re.match, but you might need it with re.search instead...)

Rennin answered 12/11, 2009 at 15:46 Comment(0)
F
1

\A[^\w\W]

Works regardless of regex flags.

According to regex101: For empty input string, 0 steps. For all other input strings exactly 2 steps.

Kotlin playground: https://pl.kotl.in/hdbNH73It

Fides answered 12/11, 2009 at 15:46 Comment(0)
M
1

Maybe this?

/$.+^/
Midge answered 4/12, 2009 at 5:46 Comment(6)
In Python, this approach works only if you control the flags: re.compile('$.+^', re.MULTILINE|re.DOTALL).search('a\nb\nc\n') returns a match object corresponding to the b and c (and all adjacent and in-between newlines). The negative-lookahead approach I recommend works (i.e., fails to match anything) for any combination of flags it could be compiled with.Lost
My bad - mixed up the $ and ^.Roguery
This may be an attempt to look for the end of a string before the beginning, but I've found that the $ doesn't mean 'end of string' unless it's the last character of the regex, and I expect a similar behaviour applies to ^, so this might match a substring starting with a literal $, and ending with a literal ^Martian
@pavium, it certainly doesn't behave that way in Python or Javascript. Unless you escape them with \ or include them in a character set with [], special characters like $ and ^ should not be treated as literals. In what language did you observe this?Musing
In Perl, at least, that should be written /\z.+\A/ (see perldoc perlre) That prevents multi-line and single-line mode (use re '/ms') from affecting it.Utilitarian
@Martian is correct -- this approach is broken as a general regexp solution. In POSIX BRE regexps ^ and $ only anchor when they are the first or last character respectively in the pattern. pubs.opengroup.org/onlinepubs/7908799/xbd/…Thwart
C
0

^_^, which never matches and fails quickly.

Canova answered 12/11, 2009 at 15:46 Comment(0)
S
0

What about instead of regex, just use an always false if statement? In javascript:

var willAlwaysFalse=false;
if(willAlwaysFalse)
{
}
else
{
}
Skirret answered 4/12, 2009 at 5:46 Comment(1)
I added a comment in reply to Charlie's question, explaining why this sort of approach isn't desirable. In short, I need a group inside a regex that will always be used, but in some cases the group must be built to ensure it can never match.Musing
O
-1
'[^0-9a-zA-Z...]*'

and replace ... with all printable symbols ;). That's for a text file.

Obrian answered 12/11, 2009 at 15:46 Comment(4)
I think there has to be a shorter way for that, but that was my first thought too^^Pontonier
This will match the empty string. To catch every possible character, use [^\x00-\xFF]+ (for byte-based implementations).Backfill
A better expression would be [^\s\S]. But as Ferdinand Beyer already said, it would match an empty string.Banner
Drakosha's regex can match an empty string because of the *; leave that off, or replace it with +, and it has to match at least one character. If the class excludes all possible characters, it can't match anything.Satiety
K
-4

A portable solution that will not depend on the regexp implementation is to just use a constant string that you are sure will never appear in the log messages. For instance make a string based on the following:

cat /dev/urandom | hexdump | head -20
0000000 5d5d 3607 40d8 d7ab ce72 aae1 4eb3 ae47
0000010 c5e2 b9e8 910d a2d9 2eb3 fdff 6301 c85f
0000020 35d4 c282 e439 33d8 1c73 ca78 1e4d a569
0000030 8aca eb3c cbe4 aff7 d079 ca38 8831 15a5
0000040 818b 323f 0b02 caec f17f 387b 3995 88da
0000050 7b02 c80b 2d42 8087 9758 f56f b71f 0053
0000060 1501 35c9 0965 2c6e 03fe 7c6d f0ca e547
0000070 aba0 d5b6 c1d9 9bb2 fcd1 5ec7 ee9d 9963
0000080 6f0a 2c91 39c2 3587 c060 faa7 4ea4 1efd
0000090 6738 1a4c 3037 ed28 f62f 20fa 3d57 3cc0
00000a0 34f0 4bc2 3067 a1f7 9a87 086b 2876 1072
00000b0 d9e1 6b8f 5432 a60e f0f5 00b5 d9ef ed6f
00000c0 4a85 70ee 5ec4 a378 7786 927f f126 2ec2
00000d0 18c5 46fe b167 1ae6 c87c 1497 48c9 3c09
00000e0 8d09 e945 13ce 7da2 08af 1a96 c24c c022
00000f0 b051 98b3 2bf5 4d7d 5ec4 e016 a50d 355b
0000100 0e89 d9dd b153 9f0e 9a42 a51f 2d46 2435
0000110 ef35 17c2 d2aa 3cc7 e2c3 e711 d229 f108
0000120 324e 5d6a 650a d151 bc55 963f 41d3 66ee
0000130 1d8c 1fb1 1137 29b2 abf7 3af7 51fe 3cf4

Sure, this is not an intellectual challenge, but more like duct tape programming.

Kingfisher answered 12/11, 2009 at 15:46 Comment(0)
P
-8
new Regex(Guid.NewGuid().ToString())

Creates a pattern containing only alphanumerics and '-' (none of which are regex special characters) but it is statistically impossible for the same string to have appeared anywhere before (because that's the whole point of a GUID.)

Padauk answered 12/11, 2009 at 15:46 Comment(1)
"Statistically impossible"? Huh? Depending on how the GUID is computed, it is possible and often quite simple to predict the next GUIDs (as they depend on the machine computing them and the time). You mean "unlikely", "with a very small probability", but you cannot say "impossible" even for perfectly random strings. Your Regex will match an infinite number of strings -- this question is looking for one that won't match anything. Ever.Backfill

© 2022 - 2024 — McMap. All rights reserved.