How to implement regular expression NFA with character ranges?
Asked Answered
C

1

7

When you read such posts as Regex: NFA and Thompson's algorithm everything looks rather straightforward until you realize in real life you need not only direct characters like "7" or "b", but also:

[A-Z]
[^_]
.

namely character classes (or ranges). And thus my question -- how to build NFA using character ranges? Using meta-characters like "not A", "anything else" and then computing overlapping ranges? This would lead to using tree-like structure when using final automaton, instead of just a table.

Update: please assume non-trivial in size (>>256) alphabet.

I am asking about NFA, but later I would like to convert NFA to DFA as well.

Caw answered 24/12, 2013 at 21:43 Comment(6)
Would you clarify your mean of "build NFA using character ranges"Standing
@revo, label the edge by <a,k> meaning using this label if input is j but not if input is z. This is not that hard but having several overlapping such labels (<a,k>, h, <,>) can cause a mess. And I am not a fan of reinventing the wheel, thus I am asking.Caw
It's all in how you represent the edges. For an 8-bit character set, consider a bitmap of 256 bits. If bit n is set then character code n is in the allowed set, for example.Seashore
@tripleee, thank you. I think 8-bit in practice won't fly anymore, Unicode is more likely. But this would mean 8 KB per label (!). Is this approach used in any publicly known program, or is it just theoretical idea?Caw
If you really need to support Unicode, there is a range of other considerations to take into account. See UTS-18 on regex support. In the meantime, maybe only consider supporting UTF-8 in a sensible way for a start.Seashore
The way I did this for my compilers classwhen I was in school was the following:Create States 1 and 2. For each character that allows for transition from State 1 to state 2, create a transition object that holds (State1, State2, char). This definitely doesn't help with variably large ranges though :(Octavla
H
9

The simplest approach would be:

  1. Use segments as labels for transitions in both NFA and DFA. For example, range [a-z] would be reperesented as segment [97, 122]; single character 'a' would become [97,97]; and any character '.' would become [minCode, maxCode].

  2. Each negated range [^a-z] would result in two transitions from starting state to next state. In this example two transitions [minCode, 96] and [123, maxCode] should be created.

  3. When range is represented by enumerating all possible characters [abcz], either transition per character should be created, or the code migh first group characters into ranges to optimize the number of transitions. So the [abcz] would become [a-c]|z. Thus two transitions instead of four.

This should be enough for NFA. However the classical power set construction to transform NFA to DFA will not work when there are transitions with intersecting character ranges. To solve this issue only one additional generalization step is required. Once a set of all input symbols created, in our case it will be a set of segments, it should be transformed into a set of non-intersecting segments. This can be done in time O(n*Log(n)), where n is a number of segments in a set using priority equeue (PQ) in which segments are ordered by the left component. Example:

  Procedure DISJOIN:
   Input  <- [97, 99] [97, 100] [98, 108]
   Output -> [97, 97] [98, 99], [100, 100], [101, 108]

Step 2. To create new transitions from a "set state" the algorithm should be modified as following:

   for each symbol in DISJOIN(input symbols)
     S <- empty set of symbols
     T <- empty "set state"
     for each state in "set state"
      for each transition in state.transitions
        I <- intersection(symbol, transition.label) 
        if (I is not empty)
        {
           Add I to the set S
           Add transition.To to the T
        }       

     for each segement from DISJOIN(S)
        Create transition from "set state" to T

To speed up matching when searching for a transition and input symbol C, transitions per state might be sorted by segments and binary search applied.

Hyphen answered 24/12, 2013 at 21:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.