I have a question about DFA minimization. So I've used very well known techniques to convert regular expression into NFA and then construct DFA from it, using goto / closure algorithm. Now the question is how do I minimize it? I've watched lections about it here: http://www.youtube.com/watch?v=T9Z66NF5YRk and I still cannot get the point. What is DFA minimization? Is this just merging IDENTICAL states (states that goes to the same states on the same characters) or something different?
So, I've started with the following grammar:
%digit = '0'..'9'
%letter = 'a'..'z' | 'A'..'Z'
%exponent = ("e" | "E") ("+" | "-")? digit+
T_INT = digit+
T_FLOAT = T_INT exponent
T_IDENTIFIER = (letter | "$" | "_") (letter | "$" | "_" | digit)*
and end up with the following DFA (represented as JSON):
{
"START": [{
"type": "range",
"from": 36,
"to": 36,
"shift": "1"
}, {
"type": "range",
"from": 48,
"to": 57,
"shift": "2"
}, {
"type": "range",
"from": 65,
"to": 90,
"shift": "1"
}, {
"type": "range",
"from": 95,
"to": 95,
"shift": "1"
}, {
"type": "range",
"from": 97,
"to": 122,
"shift": "1"
}],
"1": [{
"type": "range",
"from": 36,
"to": 36,
"shift": "1"
}, {
"type": "range",
"from": 48,
"to": 57,
"shift": "1"
}, {
"type": "range",
"from": 65,
"to": 90,
"shift": "1"
}, {
"type": "range",
"from": 95,
"to": 95,
"shift": "1"
}, {
"type": "range",
"from": 97,
"to": 122,
"shift": "1"
}, {
"shift": ["t_identifier"]
}],
"2": [{
"type": "range",
"from": 48,
"to": 57,
"shift": "2"
}, {
"type": "range",
"from": 69,
"to": 69,
"shift": "3"
}, {
"type": "range",
"from": 101,
"to": 101,
"shift": "3"
}, {
"shift": ["t_int"]
}],
"3": [{
"type": "range",
"from": 43,
"to": 43,
"shift": "5"
}, {
"type": "range",
"from": 45,
"to": 45,
"shift": "5"
}, {
"type": "range",
"from": 48,
"to": 57,
"shift": "4"
}],
"4": [{
"type": "range",
"from": 48,
"to": 57,
"shift": "4"
}, {
"shift": ["t_float"]
}],
"5": [{
"type": "range",
"from": 48,
"to": 57,
"shift": "4"
}]
}
So how do I minimize it?
UPDATE:
Okay so here is my algorithm. Given following DFA:
{
0: [{
from: 97,
to: 97,
shift: 1
}],
1: [{
from: 97,
to: 97,
shift: 3
}, {
from: 98,
to: 98,
shift: 2
}],
2: [{
from: 98,
to: 98,
shift: 4
}],
3: [{
from: 97,
to: 97,
shift: 3
}, {
from: 98,
to: 98,
shift: 4
}],
4: [{
from: 98,
to: 98,
shift: 4
}]
}
This is what I do to minimize it:
For each state (numbered as 0, 1, 2, 3, 4 in my example) obtain unique hash that identifies such state (for example for state0 this will be: from=97,to=97,shift=1, for state1 this will be: from=97,to=97,shift=3&from=98,to=98,shift=2 and so on…)
Compare obtained hashes and if we find two identical ones, then merge it. In my example, state2 hash will be: from=98&shift=4&to=98, and state4 hash will be: from=98&shift=4&to=98, they are the same, so we can merge them into the new state5, after that DFA will look like this:
{ 0: [{ from: 97, to: 97, shift: 1 }], 1: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 5 }], 3: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 5 }], 5: [{ from: 98, to: 98, shift: 5 }]
}
Continue this ’till we get no changes, so the next step (merging states 1 and 3) will transform DFA into the following form:
{ 0: [{ from: 97, to: 97, shift: 6 }], 6: [{ from: 97, to: 97, shift: 6 }, { from: 98, to: 98, shift: 5 }], 5: [{ from: 98, to: 98, shift: 5 }]
}
There are no more identical states, that means we’re done.
SECOND UPDATE:
Okay, so given the following regular expression: 'a' ('ce')* ('d' | 'xa' | 'AFe')+ | 'b' ('ce')* ('d' | 'xa' | 'AFe')+ 'ce'+ I've got the following DFA (START -> start state, ["accept"] -> so to say transition to the accepting state):
{
"START": [{
"type": "range",
"from": 98,
"to": 98,
"shift": "1.2"
}, {
"type": "range",
"from": 97,
"to": 97,
"shift": "17.18"
}],
"1.2": [{
"type": "range",
"from": 120,
"to": 120,
"shift": "10"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "6.7"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "8"
}, {
"type": "range",
"from": 99,
"to": 99,
"shift": "4"
}],
"10": [{
"type": "range",
"from": 97,
"to": 97,
"shift": "6.7"
}],
"6.7": [{
"type": "range",
"from": 99,
"to": 99,
"shift": "15"
}, {
"type": "range",
"from": 120,
"to": 120,
"shift": "13"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "6.7"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "11"
}],
"15": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "14.accept"
}],
"14.accept": [{
"type": "range",
"from": 99,
"to": 99,
"shift": "16"
}, {
"shift": ["accept"]
}],
"16": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "14.accept"
}],
"13": [{
"type": "range",
"from": 97,
"to": 97,
"shift": "6.7"
}],
"11": [{
"type": "range",
"from": 70,
"to": 70,
"shift": "12"
}],
"12": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "6.7"
}],
"8": [{
"type": "range",
"from": 70,
"to": 70,
"shift": "9"
}],
"9": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "6.7"
}],
"4": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "2.3"
}],
"2.3": [{
"type": "range",
"from": 120,
"to": 120,
"shift": "10"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "6.7"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "8"
}, {
"type": "range",
"from": 99,
"to": 99,
"shift": "5"
}],
"5": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "2.3"
}],
"17.18": [{
"type": "range",
"from": 120,
"to": 120,
"shift": "25"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "22.accept"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "23"
}, {
"type": "range",
"from": 99,
"to": 99,
"shift": "20"
}],
"25": [{
"type": "range",
"from": 97,
"to": 97,
"shift": "22.accept"
}],
"22.accept": [{
"type": "range",
"from": 120,
"to": 120,
"shift": "28"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "22.accept"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "26"
}, {
"shift": ["accept"]
}],
"28": [{
"type": "range",
"from": 97,
"to": 97,
"shift": "22.accept"
}],
"26": [{
"type": "range",
"from": 70,
"to": 70,
"shift": "27"
}],
"27": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "22.accept"
}],
"23": [{
"type": "range",
"from": 70,
"to": 70,
"shift": "24"
}],
"24": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "22.accept"
}],
"20": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "18.19"
}],
"18.19": [{
"type": "range",
"from": 120,
"to": 120,
"shift": "25"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "22.accept"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "23"
}, {
"type": "range",
"from": 99,
"to": 99,
"shift": "21"
}],
"21": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "18.19"
}]
}
Story is the same, how do I minimize it? If I follow classical Hopcroft's algorithm with all this table construction, determining indistinguishable states, merging them together and so on, then I'm going to get DFA that contains 15 states (use this tool: http://regexvisualizer.apphb.com/ with this regular expression a(ce)(d|xa|AFe)+|b(ce)(d|xa|AFe)+ce+ to check that). Here is how DFA looks like after minification with Hopcroft's algorithm:
The algorithm that I came up with, after "rethinking" Hopcroft's algorithm, builds DFA that is smaller than the one you can see above (sorry for the image quality, I had to redraw it step by step to understand why it's smaller):
And here is how it works, decision about "state equivalence" is based on the result of the hash function that is given the state (for example "START") builds short strings that can be constructed from the DFA if we start from that state. Given the DFA above, and the START state, we can construct following strings: 98->120, 98->100, 98->65, 98->99, 97->120, 97->100, 97->65, 97->99 so let it be the result of the hash function for START state. If we run this function for each state in the DFA, we will see that for the some states this function gives us identical results ("1.2", "6.7", "2.3" AND "10", "13" AND "15", "16" AND "11", "8", "26", "23" AND "12", "9", "4", "5", "20", "21" AND "17.18", "18.19" AND "25", "28" AND "27", "24") so all we need to do is to merge those states together.
I see that I'm wrong somewhere, but don't understand what is wrong with minimized DFA produced by my algorithm?