How to determine if a regex is orthogonal to another regex?

Asked 28/1, 2009 at 20:2 Answered 9/10, 2012 at 23:30

I guess my question is best explained with an (simplified) example.

Regex 1:

^\d+_[a-z]+$

Regex 2:

^\d*$

Regex 1 will never match a string where regex 2 matches. So let's say that regex 1 is orthogonal to regex 2.

As many people asked what I meant by orthogonal I'll try to clarify it:

Let S1 be the (infinite) set of strings where regex 1 matches. S2 is the set of strings where regex 2 matches. Regex 2 is orthogonal to regex 1 iff the intersection of S1 and S2 is empty. The regex ^\d_a$ would be not orthogonal as the string '2_a' is in the set S1 and S2.

How can it be programmatically determined, if two regexes are orthogonal to each other?

Best case would be some library that implements a method like:

/**
 * @return True if the regex is orthogonal (i.e. "intersection is empty"), False otherwise or Null if it can't be determined
 */
public Boolean isRegexOrthogonal(Pattern regex1, Pattern regex2);

Calibre answered 28/1, 2009 at 20:2 Comment(0)

I finally found exactly the library that I was looking for:

dk.brics.automaton

Usage:

/**
 * @return true if the two regexes will never both match a given string
 */
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
   Automaton automaton1 = new RegExp(regex1).toAutomaton();
   Automaton automaton2 = new RegExp(regex2).toAutomaton();
   return automaton1.intersection(automaton2).isEmpty();
}

It should be noted that the implementation doesn't and can't support complex RegEx features like back references. See the blog post "A Faster Java Regex Package" which introduces dk.brics.automaton.

Calibre answered 9/10, 2012 at 23:30 Comment(0)

By "Orthogonal" you mean "the intersection is the empty set" I take it?

I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...

Then again, I'm a theorist...

Couteau answered 28/1, 2009 at 20:38 Comment(4)

Are there algorithms to compute the regular expression for the intersection of two given regular expressions, and then to convert it to normal form, or would those be manual operations? – Cartier 28/1, 2009 at 20:55

there are algorithms. I don't have any of my theory texts here at work, but in the worst case it involves converting to DFAs, and then creating an intersection DFA that keeps track of which state each of the other DFAs would be in. then you convert back to regular grammar in normal form... – Couteau 28/1, 2009 at 21:40

Any Theory of Computation text should have it. I don't have any online references handy... – Couteau 30/1, 2009 at 19:12

I'm not a theorist, and my first thought was to convert the RE to a DFA or NFA (probably DFA) and go from there... – Sukkah 1/2, 2009 at 17:53

I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...

That seems like shooting sparrows with a cannon. Why not just construct the product automaton and check if an accept state is reachable from the initial state? That'll also give you a string in the intersection straight away without having to construct a regular expression first.

I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem.

I only know of a way to do it which involves creating a DFA from a regexp, which is exponential time (in the degenerate case). It's reducible to the halting problem, because everything is, but the halting problem is not reducible to it.

If the last, then you can use the fact that any RE can be translated into a finite state machine. Two finite state machines are equal if they have the same set of nodes, with the same arcs connecting those nodes.

So, given what I think you're using as a definition for orthogonal, if you translate your REs into FSMs and those FSMs are not equal, the REs are orthogonal.

That's not correct. You can have two DFAs (FSMs) that are non-isomorphic in the edge-labeled multigraph sense, but accept the same languages. Also, were that not the case, your test would check whether two regexps accepted non-identical, whereas OP wants non-overlapping languages (empty intersection).

Also, be aware that the \1, \2, ..., \9 construction is not regular: it can't be expressed in terms of concatenation, union and * (Kleene star). If you want to include back substitution, I don't know what the answer is. Also of interest is the fact that the corresponding problem for context-free languages is undecidable: there is no algorithm which takes two context-free grammars G1 and G2 and returns true iff L(G1) ∩ L(g2) ≠ Ø.

Monophonic answered 1/2, 2009 at 17:9 Comment(8)

Excellent point on the \1, \2 bit... that's context free, and so not solvable. Minor point: Not EVERYTHING is reducible to Halt... Program Equivalence for example.. – Couteau 3/2, 2009 at 14:49

@Brian Postow. Surely everything is reducible to the halting problem, including program equivalence (in the sense that I can guarantee that, if you give me a program H that solves the halting problem, I will be able to give you a program that solves the program equivalence problem)? – Septennial 21/6, 2010 at 21:17

@psmears. Nope. Even if you gave me a program that solves the halting problem I STILL wouldn't be able to solve program equivalence. there's an entire hierarchy of things that are harder than Halt... – Couteau 22/6, 2010 at 13:31

@Septennial Unless, you're talking in the False -> False is True sense... but that's not how reduction works... – Couteau 22/6, 2010 at 13:34

@Brian Postow: When you say "that's not how reduction works", you have to define what sort of reduction you mean :) There are a number of possible senses; one is "given a program P (satisfying certain properties) that solves problem A, transform it to a program P' solving problem B"; another involves transforming problem instances; another involves creating a program with access to an oracle. Under different definitions, "does X reduce to Y" will have different answers for the same X & Y. My point was that there does exist a meaningful sense in which the original statement is true :) – Septennial 22/6, 2010 at 15:0

@psmears: perhaps, but no one who studies reductions on unsolvable problems considers "there is no program for Halt, therefore any statement starting with 'if you are given a program for halt, then you can....' is true" to be a reasonable statement of reduction... – Couteau 22/6, 2010 at 18:30

@Brian Postow: Of course they would, if they have any sort of mathematical background! The reason the statement sounds strange is that the useful way of using this definition of reduction is the other way round (i.e. if I have a problem P, and you can show that the halting problem reduces to P, then we know straight away that P is undecidable). I think this is the point that was originally being made - i.e. that other people were talking about reduction the wrong way round ("This problem reduces to the halting problem" doesn't tell us what we actually want to know :). – Septennial 22/6, 2010 at 22:19

@psmears. It depends on what we want to know... It's still not true that "everything reduces to the halting problem" using normal definitions of "reduction". NORMALLY, we look at Halt reducing to other problems, you're right, but in other parts of complexity theory, we look at things reducing to Halt... and program equivalence does not reduce to Halt. – Couteau 23/6, 2010 at 15:6

It's been two years since this question was posted, but I'm happy to say this can be determined now simply by calling the "genex" program here: https://github.com/audreyt/regex-genex

$ ./binaries/osx/genex '^\d+_[a-z]+$' '^\d*$'
$

The empty output means there is no strings that matches both regex. If they have any overlap, it will output the entire list of overlaps:

$ runghc Main.hs '\d' '[123abc]' 
1.00000000      "2"
1.00000000      "3"
1.00000000      "1"

Hope this helps!

Lynsey answered 23/5, 2011 at 16:1 Comment(3)

very nice! thanks. I've tested a few examples but I noticed that some work not as I expected. Example: runghc Main.hs '710/([a-z]){4,8}/.*.jpg' '710.*' just returns nothing. – Calibre 27/5, 2011 at 0:23

Right, that's because the ".*" was translated as {0,3} by default. Try this: genex '710/([a-z]){4,8}/.*.jpg' '710.{4,10}' Hope it helps! – Lynsey 29/5, 2011 at 14:44

@Lynsey That seems like cheating. Does this mean your approach couldn't prove that these regex don't intersect: ^1(11)*$ and ^(11)*$? – John 11/10, 2011 at 21:12

The fsmtools can do all kinds of operations on finite state machines, your only problem would be to convert the string representation of the regular expression into the format the fsmtools can work with. This is definitely possible for simple cases, but will be tricky in the presence of advanced features like look{ahead,behind}.

You might also have a look at OpenFst, although I've never used it. It supports intersection, though.

Lizethlizette answered 1/2, 2009 at 16:36 Comment(0)

Excellent point on the \1, \2 bit... that's context free, and so not solvable. Minor point: Not EVERYTHING is reducible to Halt... Program Equivalence for example.. – Brian Postow

[I'm replying to a comment]

IIRC, a^n b^m a^n b^m is not context free, and so (a\*)(b\*)\1\2 isn't either since it's the same. ISTR { ww | w ∈ L } not being "nice" even if L is "nice", for nice being one of regular, context-free.

I modify my statement: everything in RE is reducible to the halting problem ;-)

Monophonic answered 4/2, 2009 at 8:46 Comment(2)

You are correct on all counts. \1, etc makes the language context sensitive. I don't THINK it makes it turing complete, but I'd have to think a little harder on that. – Couteau 4/2, 2009 at 16:59

Pattern matching with back references is NP-complete---so sayeth en.wikipedia.org/wiki/Regular_expression#cite_ref-Aho90_24-0, referencing Theorem 6.2 of {Aho, Alfred V. (1990). "Algorithms for finding patterns in strings". In van Leeuwen, Jan. Handbook of Theoretical Computer Science, volume A: Algorithms and Complexity. The MIT Press. pp. 255–300}. – Paleoasiatic 3/2, 2014 at 20:56

I finally found exactly the library that I was looking for:

dk.brics.automaton

Usage:

/**
 * @return true if the two regexes will never both match a given string
 */
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
   Automaton automaton1 = new RegExp(regex1).toAutomaton();
   Automaton automaton2 = new RegExp(regex2).toAutomaton();
   return automaton1.intersection(automaton2).isEmpty();
}

It should be noted that the implementation doesn't and can't support complex RegEx features like back references. See the blog post "A Faster Java Regex Package" which introduces dk.brics.automaton.

Calibre answered 9/10, 2012 at 23:30 Comment(0)

You can maybe use something like Regexp::Genex to generate test strings to match a specified regex and then use the test string on the 2nd regex to determine whether the 2 regexes are orthogonal.

Ophidian answered 28/1, 2009 at 20:17 Comment(2)

If I get you right, than it's fuzzing what you propose. Actually that's not what I'm looking for. – Calibre 28/1, 2009 at 21:15

Yup, definitely not a proof, just two blackbox tests. – Ophidian 28/1, 2009 at 21:29

Proving that one regular expression is orthogonal to another can be trivial in some cases, such as mutually exclusive character groups in the same locations. For any but the simplest regular expressions this is a nontrivial problem. For serious expressions, with groups and backreferences, I would go so far as to say that this may be impossible.

Herzog answered 28/1, 2009 at 20:23 Comment(0)

I believe kdgregory is correct you're using Orthogonal to mean Complement.

Is this correct?

Ancy answered 28/1, 2009 at 20:44 Comment(3)

I don't think he means complement, but that the intersection is empty. – Writhe 28/1, 2009 at 21:3

Intersection tends to be commutative(?), in that it applies both ways. The relationship this author is looking for probably does not have a single common name, but he explicitly stated that it is a one way relationship. – Herzog 28/1, 2009 at 21:41

"The relationship this author is looking for probably does not have a single common name" -- How about `disjoint'? Also, orthogonal is most often symmetric (~= commutative), so it seems like a reasonable guess. – Paleoasiatic 5/2, 2009 at 10:37

Let me start by saying that I have no idea how to construct such an algorithm, nor am I aware of any library that implements it. However, I would not be at all surprised to learn that nonesuch exists for general regular expressions of arbitrary complexity.

Every regular expression defines a regular language of all the strings that can be generated by the expression, or if you prefer, of all the strings that are "matched by" the regular expression. Think of the language as a set of strings. In most cases, the set will be infinitely large. Your question asks whether the intersections of the two sets given by the regular expressions is empty or not.

At least to a first approximation, I can't imagine a way to answer that question without computing the sets, which for infinite sets will take longer than you have. I think there might be a way to compute a limited set and determine when a pattern is being elaborated beyond what is required by the other regex, but it would not be straightforward.

For example, just consider the simple expressions (ab)* and (aba)*b. What is the algorithm that will decide to generate abab from the first expression and then stop, without checking ababab, abababab, etc. because they will never work? You can't just generate strings and check until a match is found because that would never complete when the languages are disjoint. I can't imagine anything that would work in the general case, but then there are folks much better than me at this kind of thing.

All in all, this is a hard problem. I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem. Although, given that regular expressions are not Turing complete, it seems at least possible that a solution exists.

Cartier answered 28/1, 2009 at 20:46 Comment(1)

well that's why I posted this question here ;) – Calibre 28/1, 2009 at 21:19

I would do the following:

convert each regex to a FSA, using something like the following structure:

struct FSANode
{
    bool accept;
    Map<char, FSANode> links;
}
List<FSANode> nodes;
FSANode start;

Note that this isn't trivial, but for simple regex shouldn't be that difficult.

Make a new Combined Node like:

class CombinedNode
{
    CombinedNode(FSANode left, FSANode right)
    {
        this.left = left;
        this.right = right;
    }

    Map<char, CombinedNode> links;
    bool valid { get { return !left.accept || !right.accept; } }

    public FSANode left;
    public FSANode right;
}

Build up links based on following the same char on the left and right sides, and you get two FSANodes which make a new CombinedNode.

Then start at CombinedNode(leftStart, rightStart), and find the spanning set, and if there are any non-valid CombinedNodes, the set isn't "orthogonal."

Writhe answered 28/1, 2009 at 20:56 Comment(0)

Convert each regular expression into a DFA. From the accept state of one DFA create an epsilon transition to the start state of the second DFA. You will in effect have created an NFA by adding the epsilon transition. Then convert the NFA into a DFA. If the start state is not the accept state, and the accept state is reachable, then the two regular expressions are not "orthogonal." (Since their intersection is non-empty.)

There are know procedures for converting a regular expression to a DFA, and converting an NFA to a DFA. You could look at a book like "Introduction to the Theory of Computation" by Sipser for the procedures, or just search around the web. No doubt many undergrads and grads had to do this for one "theory" class or another.

Uzziel answered 1/2, 2009 at 17:21 Comment(1)

This assumes that the strings in the language are strictly concatinations. Consider the languages "ab+" and "ab*" over the alphabet {a,b}. All strings in "ab+" are in the intersection. Pretty much the only string matched by "ab*" not matched by "ab+" is the string "a". Your construction, assuming I am understanding it correctly would produce the language "ab+(ab*)*" would it not? – Edroi 19/4, 2018 at 2:9

I spoke too soon. What I said in my original post would not work out, but there is a procedure for what you are trying to do if you can convert your regular expressions into DFA form.

You can find the procedure in the book I mentioned in my first post: "Introduction to the Theory of Computation" 2nd edition by Sipser. It's on page 46, with details in the footnote.

The procedure would give you a new DFA that is the intersection of the two DFAs. If the new DFA had a reachable accept state then the intersection is non-empty.

Uzziel answered 1/2, 2009 at 17:51 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags