The simplest approach would be:
Use segments as labels for transitions in both NFA and DFA. For example, range [a-z] would be reperesented as segment [97, 122]
; single character 'a' would become [97,97]
; and any character '.' would become [minCode, maxCode]
.
Each negated range [^a-z] would result in two transitions from starting state to next state. In this example two transitions [minCode, 96]
and [123, maxCode]
should be created.
When range is represented by enumerating all possible characters [abcz], either transition per character should be created, or the code migh first group characters into ranges to optimize the number of transitions. So the [abcz] would become [a-c]|z
. Thus two transitions instead of four.
This should be enough for NFA. However the classical power set construction to transform NFA to DFA will not work when there are transitions with intersecting character ranges.
To solve this issue only one additional generalization step is required. Once a set of all input symbols created, in our case it will be a set of segments, it should be transformed into a set of non-intersecting segments. This can be done in time O(n*Log(n)), where n is a number of segments in a set using priority equeue (PQ) in which segments are ordered by the left component. Example:
Procedure DISJOIN:
Input <- [97, 99] [97, 100] [98, 108]
Output -> [97, 97] [98, 99], [100, 100], [101, 108]
Step 2. To create new transitions from a "set state" the algorithm should be modified as following:
for each symbol in DISJOIN(input symbols)
S <- empty set of symbols
T <- empty "set state"
for each state in "set state"
for each transition in state.transitions
I <- intersection(symbol, transition.label)
if (I is not empty)
{
Add I to the set S
Add transition.To to the T
}
for each segement from DISJOIN(S)
Create transition from "set state" to T
To speed up matching when searching for a transition and input symbol C, transitions per state might be sorted by segments and binary search applied.
<a,k>
meaning using this label if input isj
but not if input isz
. This is not that hard but having several overlapping such labels (<a,k>
,h
,<,>
) can cause a mess. And I am not a fan of reinventing the wheel, thus I am asking. – Caw