How to tell if one regular expression matches a subset of another regular expression?
Asked Answered
L

7

21

I'm just wondering if it's possible to use one regular expression to match another, that is some sort of:

['a-z'].match(['b-x'])
True

['m-n'].match(['0-9'])
False

Is this sort of thing possible with regex at all? I'm doing work in python, so any advice specific to the re module's implementation would help, but I'll take anything I can get concerning regex.

Edit: Ok, some clarification is obviously in order! I definitely know that normal matching syntax would look something like this:

expr = re.compile(r'[a-z]*')
string = "some words"
expr.match(string)
<sRE object blah blah>

but I'm wondering if regular expressions have the capability to match other, less specific expressions in the non-syntacticly correct version I tried to explain with above, any letter from b-x would always be a subset (match) of any letter from a-z. I know just from trying that this isn't something you can do by just calling the match of one compiled expression on another compiled expression, but the question remains: is this at all possible?

Let me know if this still isn't clear.

Lighthearted answered 15/6, 2011 at 19:47 Comment(5)
You mean will one regex provide same or subset of matches as another?Scaler
Matching with re is written in 2 different forms: re.match(regex, string) or re.compile(regex).match(string). Could you please correct the code you provide because what you want to achieve is unclear.Swiftlet
Each regular expression matches a set of strings (an infinite set for some regexps). Do you want to know whether the two sets overlap? Or whether the second set is a subset of the first? (I'm not sure how to do either way but I think it needs to be clarified.)Grisaille
Do you mean regular expressions equivalence?Vaporing
this library claims support for mathematical regexes, which would mean that you can do union on them: leafstorm/lexingtonBabylon
G
13

I think — in theory — to tell whether regexp A matches a subset of what regexp B matches, an algorithm could:

  1. Compute the minimal Deterministic Finite Automaton of B and also of the "union" A|B.
  2. Check if the two DFAs are identical. This is true if and only if A matches a subset of what B matches.

However, it would likely be a major project to do this in practice. There are explanations such as Constructing a minimum-state DFA from a Regular Expression but they only tend to consider mathematically pure regexps. You would also have to handle the extensions that Python adds for convenience. Moreover, if any of the extensions cause the language to be non-regular (I am not sure if this is the case) you might not be able to handle those ones.

But what are you trying to do? Perhaps there's an easier approach...?

Grisaille answered 15/6, 2011 at 21:3 Comment(3)
The extensions are not just for convenience, and they make the problem undecidable.Charland
that link doesn't work for me. "You don't have permission to access /~jdonalds/331/lecture05.html on this server." :(Honewort
Here's an archive of the link: web.archive.org/web/20120702185839/http://www.cs.oberlin.edu/…Grisaille
F
7

Verification of the post by "antinome" using two regex : 55* and 5* :

REGEX_A: 55* [This matches "5", "55", "555" etc. and does NOT match "4" , "54" etc]

REGEX_B: 5* [This matches "", "5" "55", "555" etc. and does NOT match "4" , "54" etc]

[Here we've assumed that 55* is not implicitly .55.* and 5* is not .5.* - This is why 5* does not match 4]

REGEX_A can have an NFA as below:

  {A}--5-->{B}--epsilon-->{C}--5-->{D}--epsilon-->{E}
           {B} -----------------epsilon --------> {E} 
                          {C} <--- epsilon ------ {E}

REGEX_B can have an NFA as below:

  {A}--epsilon-->{B}--5-->{C}--epsilon-->{D}
  {A} --------------epsilon -----------> {D} 
                 {B} <--- epsilon ------ {D}

Now we can derive NFA * DFA of (REGEX_A|REGEX_B) as below:

  NFA:
  {state A}  ---epsilon --> {state B} ---5--> {state C} ---5--> {state D}
                                              {state C} ---epsilon --> {state D} 
                                              {state C} <---epsilon -- {state D}
  {state A}  ---epsilon --> {state E} ---5--> {state F}
                            {state E} ---epsilon --> {state F} 
                            {state E} <---epsilon -- {state F}

  NFA -> DFA:

       |   5          |  epsilon*
   ----+--------------+--------
    A  |  B,C,E,F,G   |   A,C,E,F
    B  |  C,D,E,F     |   B,C,E,F
    c  |  C,D,E,F     |   C
    D  |  C,D,E,F,G   |   C,D,E,F
    E  |  C,D,E,F,G   |   C,E,F
    F  |  C,E,F,G     |   F
    G  |  C,D,E,G     |   C,E,F,G

                    5(epsilon*)
    -------------+---------------------
              A  |  B,C,E,F,G 
      B,C,E,F,G  |  C,D,E,F,G 
      C,D,E,F,G  |  C,D,E,F,G 

    Finally the DFA for (REGEX_A|REGEX_B) is:
         {A}--5--->{B,C,E,F,G}--5--->{C,D,E,F,G}
                                     {C,D,E,F,G}---5--> {C,D,E,F,G}

         Note: {A} is start state and {C,D,E,F,G} is accepting state. 

Similarly DFA for REGEX_A (55*) is:

       |   5    |  epsilon*
   ----+--------+--------
    A  | B,C,E  |   A
    B  | C,D,E  |   B,C,E
    C  | C,D,E  |   C
    D  | C,D,E  |   C,D,E
    E  | C,D,E  |   C,E


            5(epsilon*)
   -------+---------------------
       A  |  B,C,E  
   B,C,E  |  C,D,E
   C,D,E  |  C,D,E

    {A} ---- 5 -----> {B,C,E}--5--->{C,D,E}
                                    {C,D,E}--5--->{C,D,E}
Note: {A} is start state and {C,D,E} is accepting state

Similarly DFA for REGEX_B (5*) is:

       |   5    |  epsilon*
   ----+--------+--------
    A  | B,C,D  |   A,B,D
    B  | B,C,D  |   B
    C  | B,C,D  |   B,C,D
    D  | B,C,D  |   B,D


            5(epsilon*)
   -------+---------------------
       A  |  B,C,D  
   B,C,D  |  B,C,D

    {A} ---- 5 -----> {B,C,D}
                      {B,C,D} --- 5 ---> {B,C,D}
Note: {A} is start state and {B,C,D} is accepting state

Conclusions:

DFA of REGX_A|REGX_B identical to DFA of REGX_A 
      -- implies REGEX_A is subset of REGEX_B
DFA of REGX_A|REGX_B is NOT identical to DFA of REGX_B 
      -- cannot infer about either gerexes.
Fryer answered 3/8, 2012 at 12:18 Comment(0)
I
6

In addition to antinome's answer:

Many of the constructs that are not part of the basic regex definition are still regular, and can be converted after parsing the regex (with a real parser, because the language of regex is not regular itself): (x?) to (x|), (x+) to (xx*), character classes like [a-d] to their corresponding union (a|b|c|d) etc. So one can use these constructs and still test whether one regex matches a subset of the other regex using the DFA comparison mentioned by antinome.

Some constructs, like back references, are not regular, and cannot be represented by NFA or DFA.

Even the seemingly simple problem of testing whether a regex with back references matches a particular string is NP-complete (http://perl.plover.com/NPC/NPC-3COL.html).

Iceboat answered 15/6, 2011 at 21:44 Comment(1)
Allowing back references probably makes the stated problem undecidable, but I have no proof of this.Charland
B
2
pip3 install https://github.com/leafstorm/lexington/archive/master.zip
python3
>>> from lexington.regex import Regex as R
>>> from lexington.regex import Null
>>> from functools import reduce
>>> from string import ascii_lowercase, digits
>>> a_z = reduce(lambda a, b: a | R(b), ascii_lowercase, Null)
>>> b_x = reduce(lambda a, b: a | R(b), ascii_lowercase[1:-2], Null)
>>> a_z | b_x == a_z
True
>>> m_n = R("m") | R("n")
>>> zero_nine = reduce(lambda a, b: a | R(b), digits, Null)
>>> m_n | zero_nine == m_n
False

Also tested successfully with Python 2. See also how to do it with a different library.

Alternatively, pip3 install https://github.com/ferno/greenery/archive/master.zip and:

from greenery.lego import parse as p
a_z = p("[a-z]")
b_x = p("[b-x]")
assert a_z | b_x == a_z
m_n = p("m|n")
zero_nine = p("[0-9]")
assert not m_n | zero_nine == m_n
Babylon answered 20/8, 2015 at 11:10 Comment(0)
I
1

It's possible with the string representation of a regex, since any string can be matched with regexes, but not with the compiled version returned by re.compile. I don't see what use this would be, though. Also, it takes a different syntax.

Edit: you seem to be looking for the ability to detect whether the language defined by an RE is a subset of another RE's. Yes, I think that's possible, but no, Python's re module doesn't do it.

Ibson answered 15/6, 2011 at 19:51 Comment(1)
I think the idea the OP is going for here is "pattern equality", though from the example given, what the OP considers True is not truly equality, as the second pattern matches a subset of the first.Lodestone
M
1

You should do something along these lines:

re.match("\[[b-x]-[b-x]\]", "[a-z]")

The regular expression has to define what the string should look like. If you want to match an opening square bracket followed by a letter from b to x, then a dash, then another letter from b to x and finally a closing square bracket, the solution above should work.

If you intend to validate that a regular expression is correct you should consider testing if it compiles instead.

Mistassini answered 15/6, 2011 at 19:56 Comment(1)
Matching the actual expression strings is too hacky (every expression needs manual work), I'd rather use a library that implements real regular expressions, like in my answer.Babylon
A
0

Some clarification is required, I think:

.

rA = re.compile('(?<! )[a-z52]+')

'(?<! )[a-z52]+' is a pattern

rA is an instance of class RegexObject whose type is < *type '_sre.SRE_Pattern' >* .

Personally I use the term regex exclusively for this kind of objects, not for the pattern .

Note that rB = re.compile(rA) is also possible, it produces the same object (id(rA) == id(rB) equals to True)


ch = 'lshdgbfcs luyt52uir bqisuytfqr454'
x = rA.match(ch) 
# or
y = rA.search(ch)

x and y are instances of the class MatchObject whose type is *< type '_sre.SRE_Match' >*


.

That said, you want to know if there a way to determine if a regex rA can match all the strings matched by another regex rB while rB matches only a subsest of all the strings matched by rA.

I don't think such a way exists, whatever theoretically or practically.

Apophthegm answered 16/6, 2011 at 0:50 Comment(1)
If you restrict your self to actual mathematical regular expressions, it does work in theory and practice! See my answer!Babylon

© 2022 - 2024 — McMap. All rights reserved.