Is there a way to negate a regular expression?
Asked Answered
W

2

16

Given a regular expression R that describes a regular language (no fancy backreferences). Is there an algorithmic way to construct a regular expression R* that describes the language of all words except those described by R? It should be possible as Wikipedia says:

The regular languages are closed under the various operations, that is, if the languages K and L are regular, so is the result of the following operations: […] the complement ¬L

For example, given the alphabet {a,b,c}, the inverse of the language (abc*)+ is (a|(ac|b|c).*)?


As DPenner has already pointed out in the comments, the inverse of a regular expresion can be exponentially larger than the original expression. This makes inversing regular expressions unsuitable to implement negative partial expression syntax for searching purposes. Is there an algorithm that preserves the O(n*m) runtime characteristic (where n is the size of the regex and m is the length of the input) of regular expression matching and allows for negated subexpressions?

Windfall answered 11/3, 2013 at 11:30 Comment(33)
Construct the NFA from regex, construct the DFA from NFA (there should be an algo for this), turn the non-terminal states into terminal states and vice-versa, then derive the regex from DFA.Eyelet
@Eyelet That sounds like an interesting idea; the problem is, that a NFA of length O(n) may correspond to a DFA of length O(2^n).Windfall
I didn't go through proper study on this topic, but I can't see how an NFA would generate a bigger DFA. Can you show me one such case?Eyelet
@Eyelet Each state of the DFA associated with an NFA is a reachable member of the powerset of the NFA. This makes an upper bound of 2^n states where n is the number of states of the NFA. It is possible to reach this by constructing a regular language where each subset of NFA states can be reached. There was a simple example somewhere but I just can't find it.Windfall
See the relevant question on theoretical CS stack overflow. NFA->DFA->inversion->regex is worst-case optimal.Submiss
Your Wikipedia quote doesn't match up with your description of the problem. The reverse of a language is the set of all strings from that language reversed. i.e. if L = {ab, abc, aba}, the reverse, L^R = {ba, cba, aba}. Regardless though, its still true that the negation (or more technically the complement) is still regular.Conductive
@DPenner I am sorry. I may have misread that article.Windfall
@FUZxxl That's alright, formal languages can be confusing! As you rightly point out though, the standard algorithm for negating a regular expression is O(2^n) in space, and would be impractical for languages with a large amount of states in its NFA.Conductive
This is certainly an interesting question, but is this really a programming question? Wouldn't this be a better fit for ComputerScience? (Especially since you seem uninterested in working, if not mathematically pure, solutions?)Guillen
@Cyborgx37 Let me quote the FAQ: We feel the best Stack Overflow questions have a bit of source code in them, but if your question generally covers … • a specific programming problem • a software algorithmWindfall
It all boils down to a programming question. You can do a lot of funny things with plain (aka really regular) regular expresion that you can't do with "enhanced" regex, such as letting an untrusted user use them to search stuff, since they have a guaranteed linear runtime. All toolkits for plain regex I know don't support inverse matching, even though the could be converted to plain regex. It would be nice to know whether there is an algorithm that allows to implement a "match inverse" operator without breaking the linear time complexity.Windfall
@Cyborgx37 “… programming related …”. ’nuff said.Homophonous
@FUZxxl: For your second part, in practice, it will usually involve checking whether there is a match, then logically negate the result, or use negative look-ahead construct in the (ir)regular expression.Eyelet
@Eyelet Look ahead is not part of plain regex. It might break linear complexity.Windfall
@FUZxxl: Consider an extension of plain regex, where you can have plain regex, OR negative look-ahead assertion which can only contain plain regex inside (not even another assertion inside). I think the complexity will be the same. The complexity is only broken when you have a wild mix of plain regex and look-ahead.Eyelet
@FUZxxl - Ok, I buy it. Ponder, though, that "you can't prove a negative". In most branches of science, it is most often the case that a question must be stated as a positive and then the answer given in terms of evidence to support it (or lack thereof). I believe you'll find the same is true with all but the most trivial regex. It is much easier to express what you are looking for, then negate the result.Guillen
@Cyborgx37 As nhahtdh wrote in the very first comment, it is indeed easily (in the sense that the algorithm is simple) possible to compute the inverse of a regex. It's just not easy (in the sense that it has a high computational complexity).Windfall
@Eyelet Wouldn't that requite backtracking? The key to a linear time implementation is to avoid backtracking as hell.Windfall
The problem with my extension is that, while it is useful to negate some plain regex, it breaks the property where 2 regex can be freely concatenated, OR'ed together - which I think is the motivation for finding the plain regex which is the negation of another plain regex.Eyelet
@FUZxxl: Assuming that you can match a string with linear complexity (or confirm that there is no match) with a plain regex, the negative look-ahead simply negate the result of matching. (If the assumption is wrong, then even plain regex will have trouble checking that there is no match).Eyelet
@Eyelet I can't really imagine how this should work. Could ou post a more detailed sketch of the algorithm in an answer?Windfall
@FUZxxl: It is not an algorithm or anything. Simply put, it is a form of checking whether there is a match, then logically negate the result (has match --> false, no match --> true).Eyelet
@Eyelet This seems to fail when the negated expression is part of a larger expression.Windfall
@FUZxxl: Of course it will fail (check my "formal" definition of the extension, and the caveat a few comments later)Eyelet
Yes Complement of a regular language is regular and there is an algorithmic way possible to write a complement regular expression for a regular expression. Additionally as DPenner says about inefficiency. We can always good RE according to minimized DFA. So for an RE there we can evaluate RE that is capable to accept complement language.Antepast
What do you mean: »We can always good RE according to minimized DFA«Windfall
@FUZxxl no! that why I use word good instead of minimized.Antepast
@GrijeshChauhan I still don't really get what you want to tell me. I am sorry. I have problems parsing your grammar.Windfall
@FUZxxl OK, (1) Complement of Regular Language(RE) is regular (2) to find RE of complement language. do like RE-->NFA-->DFA-->minimized DFA --> COMPLEMENT DFA --> RE. (a) RE-->NFA is algorithmic LEX parser (b) NFA --> DFA is algorithm (c) DFA --> MINIMIZED DFA again algorithmic (both b and c school time assignments) (d) DFA --> COMPLEMENT DFA is algorithmic (by reading my answer I lined you can realized) (e) complement DFA-->RE is algorithm using ARDEN'S THEOREM. So What I mean to say for a given RE you can find RE for complement.Antepast
@FUZxxl because for complement DFA we need to make complement DFA (not partial) That is reason we gets complicated RE for complement language. But We can do some reduction work by MINIMIZE complement DFAAntepast
@Grijesh Thanks for summing up what all the others said before. The problem is that the DFA's size might be exponential to the size of the NFA, causing the complement regex to have a size exponential to the original regex.Windfall
@FUZxxl Yes but you can always minimized NFA into DFA. Also What I means complete DFA always comes with more states.Antepast
@GrijeshChauhan Tell me please how this reduces the length of the generated regular expression or the intermediate DFA.Windfall
L
4

Unfortunately, the answer given by nhahdtdh in the comments is as good as we can do (so far). Whether a given regular expression generates all strings is PSPACE-complete. Since all problems in NP are in PSPACE-complete, an efficient solution to the universality problem would imply that P=NP.

If there were an efficient solution to your problem, would you be able to resolve the universality problem? Sure you would.

  1. Use your efficient algorithm to generate a regular expression for the negation;
  2. Determine whether the resulting regular expression generates the empty set.

Note that the problem "given a regular expression, does it generate the empty set" is fairly straightforward:

  1. The regular expression {} generates the empty set.
  2. (r + s) generates the empty set iff both r and s generate the empty set.
  3. (rs) generates the empty set iff either r or s generates the empty set.
  4. Nothing else generates the empty set.

Basically, it's pretty easy to tell whether a regular expression generates the empty set: just start evaluating the regular expression.

(Note that while the above procedure is efficient in terms of the output length, it might not be efficient in terms of the input length, if the output length is more than polynomially faster than the input length. However, if that were the case, we'd have the same result anyway, i.e., that your algorithm isn't really efficient, since it would take exponentially many steps to generate an exponentially longer output from a given input).

Leandroleaning answered 11/3, 2013 at 19:36 Comment(3)
Whether a given regular expression generates all strings - The problem description is quite vague here. I don't know what you are talking about.Eyelet
Thank you for the answer. Could you, if you like, also investigate into the second part of my question?Windfall
"Whether a given regular expression generates all strings is PSPACE-complete" I'm looking for a proof of that claim, do you have one?Numbing
P
1

Wikipedia says: ... if there exists at least one regex that matches a particular set then there exist an infinite number of such expressions. We can deduct from this statement that there is an infinite number of expressions that describe the language of all words except those described by R.

Again, (as also @nhahtdh tried to explain) the simplest algorithm to address this question is to extend the scope of evaluation outside the context of the regular expression language itself. That is: match the strings you want to exclude (which represent a finite subset to work with) by using the original regular expression and then treat any failure to match as an actual match (out of an infinite set of other possibilities). So, if the result of the match is negative, your candidate strings are a subset of the valid solutions.

Persuasion answered 11/3, 2013 at 20:42 Comment(3)
then compare it with the finite set of candidates you provide Not really. Just try to match the string with the (plain) regex. If fail to match = a match. This only works if you want to particularly negate a whole plain regex. It cannot concatenate.Eyelet
@nhahtdh, I tried to reformulate the answer. Anywho, I don't see a real application of the question.Persuasion
Check this commentEyelet

© 2022 - 2024 — McMap. All rights reserved.