DFA minimization

C

3

5

I have a question about DFA minimization. So I've used very well known techniques to convert regular expression into NFA and then construct DFA from it, using goto / closure algorithm. Now the question is how do I minimize it? I've watched lections about it here: http://www.youtube.com/watch?v=T9Z66NF5YRk and I still cannot get the point. What is DFA minimization? Is this just merging IDENTICAL states (states that goes to the same states on the same characters) or something different?

So, I've started with the following grammar:

%digit = '0'..'9'
%letter = 'a'..'z' | 'A'..'Z'
%exponent = ("e" | "E") ("+" | "-")? digit+

T_INT = digit+
T_FLOAT = T_INT exponent
T_IDENTIFIER = (letter | "$" | "_") (letter | "$" | "_" | digit)*

and end up with the following DFA (represented as JSON):

{
    "START": [{
        "type": "range",
        "from": 36,
        "to": 36,
        "shift": "1"
    }, {
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "2"
    }, {
        "type": "range",
        "from": 65,
        "to": 90,
        "shift": "1"
    }, {
        "type": "range",
        "from": 95,
        "to": 95,
        "shift": "1"
    }, {
        "type": "range",
        "from": 97,
        "to": 122,
        "shift": "1"
    }],
    "1": [{
        "type": "range",
        "from": 36,
        "to": 36,
        "shift": "1"
    }, {
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "1"
    }, {
        "type": "range",
        "from": 65,
        "to": 90,
        "shift": "1"
    }, {
        "type": "range",
        "from": 95,
        "to": 95,
        "shift": "1"
    }, {
        "type": "range",
        "from": 97,
        "to": 122,
        "shift": "1"
    }, {
        "shift": ["t_identifier"]
    }],
    "2": [{
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "2"
    }, {
        "type": "range",
        "from": 69,
        "to": 69,
        "shift": "3"
    }, {
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "3"
    }, {
        "shift": ["t_int"]
    }],
    "3": [{
        "type": "range",
        "from": 43,
        "to": 43,
        "shift": "5"
    }, {
        "type": "range",
        "from": 45,
        "to": 45,
        "shift": "5"
    }, {
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "4"
    }],
    "4": [{
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "4"
    }, {
        "shift": ["t_float"]
    }],
    "5": [{
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "4"
    }]
}

So how do I minimize it?

UPDATE:

Okay so here is my algorithm. Given following DFA:

{
    0: [{
        from: 97,
        to: 97,
        shift: 1
    }],
    1: [{
        from: 97,
        to: 97,
        shift: 3
    }, {
        from: 98,
        to: 98,
        shift: 2
    }],
    2: [{
        from: 98,
        to: 98,
        shift: 4
    }],
    3: [{
        from: 97,
        to: 97,
        shift: 3
    }, {
        from: 98,
        to: 98,
        shift: 4
    }],
    4: [{
        from: 98,
        to: 98,
        shift: 4
    }]
}

This is what I do to minimize it:

For each state (numbered as 0, 1, 2, 3, 4 in my example) obtain unique hash that identifies such state (for example for state0 this will be: from=97,to=97,shift=1, for state1 this will be: from=97,to=97,shift=3&from=98,to=98,shift=2 and so on…)

Compare obtained hashes and if we find two identical ones, then merge it. In my example, state2 hash will be: from=98&shift=4&to=98, and state4 hash will be: from=98&shift=4&to=98, they are the same, so we can merge them into the new state5, after that DFA will look like this:

{
0: [{
    from: 97,
    to: 97,
    shift: 1
}],
1: [{
    from: 97,
    to: 97,
    shift: 3
}, {
    from: 98,
    to: 98,
    shift: 5
}],
3: [{
    from: 97,
    to: 97,
    shift: 3
}, {
    from: 98,
    to: 98,
    shift: 5
}],
5: [{
    from: 98,
    to: 98,
    shift: 5
}]

}

Continue this ’till we get no changes, so the next step (merging states 1 and 3) will transform DFA into the following form:

{
0: [{
    from: 97,
    to: 97,
    shift: 6
}],
6: [{
    from: 97,
    to: 97,
    shift: 6
}, {
    from: 98,
    to: 98,
    shift: 5
}],
5: [{
    from: 98,
    to: 98,
    shift: 5
}]

}

There are no more identical states, that means we’re done.

SECOND UPDATE:

Okay, so given the following regular expression: 'a' ('ce')* ('d' | 'xa' | 'AFe')+ | 'b' ('ce')* ('d' | 'xa' | 'AFe')+ 'ce'+ I've got the following DFA (START -> start state, ["accept"] -> so to say transition to the accepting state):

{
    "START": [{
        "type": "range",
        "from": 98,
        "to": 98,
        "shift": "1.2"
    }, {
        "type": "range",
        "from": 97,
        "to": 97,
        "shift": "17.18"
    }],
    "1.2": [{
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "10"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "6.7"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "8"
    }, {
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "4"
    }],
    "10": [{
        "type": "range",
        "from": 97,
        "to": 97,
        "shift": "6.7"
    }],
    "6.7": [{
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "15"
    }, {
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "13"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "6.7"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "11"
    }],
    "15": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "14.accept"
    }],
    "14.accept": [{
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "16"
    }, {
        "shift": ["accept"]
    }],
    "16": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "14.accept"
    }],
    "13": [{
        "type": "range",
        "from": 97,
        "to": 97,
        "shift": "6.7"
    }],
    "11": [{
        "type": "range",
        "from": 70,
        "to": 70,
        "shift": "12"
    }],
    "12": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "6.7"
    }],
    "8": [{
        "type": "range",
        "from": 70,
        "to": 70,
        "shift": "9"
    }],
    "9": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "6.7"
    }],
    "4": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "2.3"
    }],
    "2.3": [{
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "10"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "6.7"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "8"
    }, {
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "5"
    }],
    "5": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "2.3"
    }],
    "17.18": [{
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "25"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "22.accept"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "23"
    }, {
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "20"
    }],
    "25": [{
        "type": "range",
        "from": 97,
        "to": 97,
        "shift": "22.accept"
    }],
    "22.accept": [{
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "28"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "22.accept"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "26"
    }, {
        "shift": ["accept"]
    }],
    "28": [{
        "type": "range",
        "from": 97,
        "to": 97,
        "shift": "22.accept"
    }],
    "26": [{
        "type": "range",
        "from": 70,
        "to": 70,
        "shift": "27"
    }],
    "27": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "22.accept"
    }],
    "23": [{
        "type": "range",
        "from": 70,
        "to": 70,
        "shift": "24"
    }],
    "24": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "22.accept"
    }],
    "20": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "18.19"
    }],
    "18.19": [{
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "25"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "22.accept"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "23"
    }, {
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "21"
    }],
    "21": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "18.19"
    }]
}

Story is the same, how do I minimize it? If I follow classical Hopcroft's algorithm with all this table construction, determining indistinguishable states, merging them together and so on, then I'm going to get DFA that contains 15 states (use this tool: http://regexvisualizer.apphb.com/ with this regular expression a(ce)(d|xa|AFe)+|b(ce)(d|xa|AFe)+ce+ to check that). Here is how DFA looks like after minification with Hopcroft's algorithm:

Hopcroft's minimized DFA

The algorithm that I came up with, after "rethinking" Hopcroft's algorithm, builds DFA that is smaller than the one you can see above (sorry for the image quality, I had to redraw it step by step to understand why it's smaller):

my algorithm

And here is how it works, decision about "state equivalence" is based on the result of the hash function that is given the state (for example "START") builds short strings that can be constructed from the DFA if we start from that state. Given the DFA above, and the START state, we can construct following strings: 98->120, 98->100, 98->65, 98->99, 97->120, 97->100, 97->65, 97->99 so let it be the result of the hash function for START state. If we run this function for each state in the DFA, we will see that for the some states this function gives us identical results ("1.2", "6.7", "2.3" AND "10", "13" AND "15", "16" AND "11", "8", "26", "23" AND "12", "9", "4", "5", "20", "21" AND "17.18", "18.19" AND "25", "28" AND "27", "24") so all we need to do is to merge those states together.

I see that I'm wrong somewhere, but don't understand what is wrong with minimized DFA produced by my algorithm?

Coyne answered 21/6, 2012 at 5:56 Comment(2)

There is also a discussion going on here: binarysculpting.com/2012/03/21/dfa-state-minimization so I hope at the end to get an answer from somewhere... – Coyne 21/6, 2012 at 8:3

Thanks everyone, my attempts to find more elegant solution for minimizing DFA didn't succeed, so I've finally implemented classical Hopcroft's, and I'm satisfied with it :) – Coyne 27/6, 2012 at 5:30

N

5

Your proposed algorithm does not do a full minimization, because it does not detect complex structures that behave identically. To understand this look at this DFA (drawn by JFLAP):

enter image description here

Minimization would combine q1 and q2, but the outlined algorithm does not manage to.

In contrast to this, Hopcroft's algorithm would initially partition like this:

   {q0, q1, q2}, {q3}

then split the first set, because q0 does not have a transition to q3:

   {q0}, {q1, q2}, {q3}

and not split further, because q1 and q2 behave identically.

Neodymium answered 21/6, 2012 at 11:32 Comment(8)

Hi Gunther, thanks for the tip, you're right, my original algorithm can't handle merge between q1 and q2, but it can handle it if I'll make minor modification: when obtaining a hash of the state, I need to check if the character goes to the same state where we found it. So the hash for q1 would be: from=99,to=99,shift=itself and the hash for q2 would be: from=99,to=99,shift=itself. Then it works with this DFA as well :) Do you agree? Can you come up with some DFA that won't be minified using this modified algorithm? – Coyne 21/6, 2012 at 12:13

No, sorry, it's not as simple as that. Instead of the "c" arc, imagine any arbitrarily complex, yet duplicated subautomaton. – Neodymium 21/6, 2012 at 12:56

I tried that already, it gives me minimal DFA on complex structures as: (('a'|'b')+('a'|'b') 'c''3'+ '0'..'9' ('a'..'z' | 'A'..'Z')+ '3'+ 'd' 'f')+ (('a'|'b') 'c'*'3'+ '0'..'9' ('a'..'z' | 'A'..'Z')+)? 'f' but still I'm getting minimal DFA (testing against Hopcroft's minimization visualizer located here: regexvisualizer.apphb.com) My algorithm gives same amount of states as an amount of states in minimized DFA produced by that tool. .. The question remains how can I prove that my algorithm is identical to Hopcroft’s algorithm, i. e. produces minimal DFA? – Coyne 21/6, 2012 at 13:25

The DFA that I showed above was for ac*d|bc*d, but what about a(ce)*d|b(ce)*d? In that case there will be no "shift=itself" transition. – Neodymium 21/6, 2012 at 15:47

Yeah, you right :) That means it's not as simple as comparing hashes. I think we can close this one. Thanks everyone! I'll continue my research. – Coyne 22/6, 2012 at 6:5

Hi again, I've decided to reopen this question. The story is that I've implemented Hopcroft's algorithm, and now I can say that I completely understand it. This morning I tried to put it a little bit different and I've got even more minification than the original one. Here is the story: What if we merge two states based on two - character length strings equivalence that we can reduce from both of this states. Please see my SECOND UPDATE to this question, where I'm going to describe my new algorithm. – Coyne 26/6, 2012 at 6:27

Sorry to see that you are not convinced of Hopcroft's algorithm. There is nothing wrong with striving for new algorithms, but as I understand from your second update, you have proved yourself that your approach does not yield correct results. The first approach seemed correct, but failed to minimize. The second approach sounds as if you are giving up correctness for ease of implementation - certainly not a good idea. What makes you think that states are equivalent by having the same set of valid transitions over two moves? It only takes a third move to show that this does not hold. – Neodymium 26/6, 2012 at 12:40

Yeah, I know that I'm wrong, and I understood my mistake. Thank you Gunther for your help, I'll stick with Hopcroft's implementation for now :) – Coyne 27/6, 2012 at 5:29

D

5

Let your original DFA be called M1. To put it in simple terms, constructing a minimized DFA (call it M2) implies converting it into a DFA that contains minimum number of states. So, the number of states in M2 will be less than number of states in M1. An important point to note here is that M1 and M2 have to be equivalent, which means that they must define the SAME regular language. Constructing a minimized DFA not only involves looking for identical states, but also the following:

Removal of "unreachable" states:
This involves removing the states that are not reachable from the initial state of the DFA, for any input string.
Removal of "dead" or "trap" states:
This involves removal of non-accepting states that terminates on itself. They are also called as TRAP states.
Removal of "Nondistinguishable" states:
This involves removing the states those that cannot be distinguished from one another for any input string.

There are also some popular algorithms used for DFA minimization:

It may be worthwhile to go through these algorithms!

Delegacy answered 21/6, 2012 at 9:14 Comment(1)

Thanks, the story is that I came up with something that minimizes my DFA to some state and as far as I can see it generates minimal DFA for the inputs I have, the question is: does it produce minimal DFA? Or it's just a bunch of code that does some minimization but the result is not minimal possible DFA... Please see my update to the original question, that describes my algorithm. – Coyne 21/6, 2012 at 9:36

N

5

Your proposed algorithm does not do a full minimization, because it does not detect complex structures that behave identically. To understand this look at this DFA (drawn by JFLAP):

enter image description here

Minimization would combine q1 and q2, but the outlined algorithm does not manage to.

In contrast to this, Hopcroft's algorithm would initially partition like this:

   {q0, q1, q2}, {q3}

then split the first set, because q0 does not have a transition to q3:

   {q0}, {q1, q2}, {q3}