How do I turn any regex into an complement of itself without complex hand editing?

Asked 20/10, 2010 at 11:52 Answered 20/10, 2010 at 16:13

The following are pseudo examples, not real regex, but still an example of what I mean:

.* (anything)

-.* (NOT anything)

[A-Z] (Any letter A to Z, caps only)

-[A-Z] (NOT any letter A to Z, caps only)

EDIT: Changed inverse into complement in the question. Here's where the change was made: "turn any regex into an complement of itself "

Hidden answered 20/10, 2010 at 11:52 Comment(0)

First of all, I believe you mean the complement of a regular expression, not it's inverse. The inverse of a regular expression doesn't make much sense; but if viewed as a function, I suppose you could say that the inverse of the matcher is the generator which generates all matching strings - or something. On the other hand, the complement of a language is all those strings not in the original language.

Then, there are two views to consider here:

Fundamentally

The complement of a regular language is regular. That means it's possible to generate an accepting DFA for the complement (and doing so is very simple, actually: just swap the non-accepting state set with the accepting state set). Any such DFA can be expressed as a regular expression - so in principle you can indeed make such a regex.

See the wikipedia article on Regular Languages as a starting point.

Practically

The typical perl-compatible regex syntax used in most modern languages nowadays does not have a complementation operator. For a complete regex, you can get something similar by using the negative lookahead operator: (?!X) will match a string precisely when X will not. However, this is a poor replacement for complement operator as you will not be able to use it as a part of a larger regex in the usual fashion; this regex doesn't "consume" input which means it behaves differently in conjunction with other operators.

For example, if you match numeric strings as [0-9]*, to match the entire string you'd prepend ^ and append $, but to use this technique to find the complement you'd need to write ^(?!^[0-9]*$).*$ - and the usual concatenation of such a negated regex is, as far as I can tell, undoable.

Somewhat ironically, the practical incarnation of regexes is theoretically more powerful due to backreferences, but practically less flexible since the language can't quite express the complement and intersection operations easily.

Masry answered 20/10, 2010 at 16:13 Comment(4)

@Eamon_Nerbonne: +1 ...and thank you for posting an answer, AND reviewing the other answers too! – Hidden 20/10, 2010 at 16:44

Don't you have to first convert the NFA to a DFA? Since an NFA accepts if any path through it ends at an accept state, you could have a string that terminates at both a non-accepting and accepting state. Reversing the two would result in the same situation, and the string would be accepted by both automata. – Travelled 24/3, 2012 at 23:39

indeed: fixed. That's actually a fairly major problem, because the conversion is exponential. I wonder if there's an alternative that avoids that issue... – Masry 3/4, 2012 at 10:38

Hmm springerlink.com/content/v7udml0tcgt0l7pj suggests that the worst case complement of an NFA is indeed exponential. – Masry 3/4, 2012 at 12:26

Just run the regex and logically invert the output. So change:

if(/foo/)

to:

if(!/foo/)

Character classes can be inverted with a leading carat:

[A-Z] -> [^A-Z]

A lot of the special characters have inverses, too, if you capitalize the specifier.

\s whitespace
\S non-whitespace
\w word character
\W non-word-character
\d digit
\D non-digit

Evaleen answered 20/10, 2010 at 11:57 Comment(3)

perfectly good answer, and clearly the simplest way of doing it, though it obviously only applies if you're talking about the whole regex being inverted; it won't work if you want to negate just a part of the regex. – Poon 20/10, 2010 at 12:2

@Eyal: Still thinking over your answer, which is different, but nice. Still pretty sure I need an inverse of the regex, not the output. Thanks for posting! – Hidden 20/10, 2010 at 12:2

"Need" for what? Homework? It's not clear what an inverse would be anyway. – Evaleen 20/10, 2010 at 12:27

Several variations to consider:

Match a string that consists of a certain set of characters: ^[a-z]*$

Match a string that consists of anything but a certain set of characters: ^[^a-z]*$

Note that there are some shortcuts:

\w: any alphanumeric character (including _),
\W: any non-alphanumeric character;
\s: any whitespace character,
\S: any non-whitespace character,
\d: any digit,
\D: any non-digit.

This can get quite complicated, for example if you want...

only non-letters: [\d_\W], or
only letters: [^\d_\W] (i. e. "not a digit, not a _, and not a non-alphanumeric character)

Match a string that contains a substring: ^.*substring.*$

Match a string that does not contain a substring: ^(?:(?!substring).)*$

Note how we have to check each position in the string for the "not-presence" of the substring. You can also substitute any regex for substring to match strings that contain or don't contain a certain sub-regex.

Match anything: .* (if you want to also match newlines, you'll have to set the corresponding option of your programming language, e. g. re.DOTALL in Python)

Match anything if you don't know how to set that option: [\s\S]*

Never match anything (for whatever reason):

$^ (i. e. match the end of the string before the start of the string),
\b\B (match a position where there is at the same time a word boundary and not a word boundary) or
(?!) (match a position where it's impossible to match the empty string).

Dominique answered 20/10, 2010 at 12:42 Comment(0)

By using negative look-ahead you will be able to handle most of the basic cases

/(?!(OriginalRegex)).*?/

Illyes answered 20/10, 2010 at 12:25 Comment(1)

@Colin_Hebert: How would I know when it would not handle "handle most of the basic cases." Thanks! – Hidden 20/10, 2010 at 13:16

You first example makes no sense, but for the second you could use class character negation:

[a-z] --> [^a-z]

Unharness answered 20/10, 2010 at 11:54 Comment(1)

@SilentGhost: Ha, I know it makes no sense, that's why I listed it as an example. Thank for pointing out class character negation, and giving an example! – Hidden 20/10, 2010 at 11:57

I am trying to understand the definition of inverse of a regex.

match(input, regular_expression) = {match1, match2, ..., matchN}

How would the inverse work? Should it be something like

match(input, inverse_regular_expression) = {imatch1, imatch2, ..., imatchN}

If so, what is the relationship between the first set of results and the second? If not, then what is it?

Limit answered 20/10, 2010 at 12:17 Comment(0)

Fundamentally

Practically

Recommended topics

Hot tags