Efficient algorithm for converting a character set into a nfa/dfa
Asked Answered
S

6

11

I'm currently working on a scanner generator. The generator already works fine. But when using character classes the algorithm gets very slow.

The scanner generator produces a scanner for UTF8 encoded files. The full range of characters (0x000000 to 0x10ffff) should be supported.

If I use large character sets, like the any operator '.' or the unicode property {L}, the nfa (and also the dfa) contains a lot of states ( > 10000 ). So the convertation for nfa to dfa and create the minimal dfa takes a long time (even if the output minimal dfa contains only a few states).

Here's my current implementation of creating a character set part of the nfa.

void CreateNfaPart(int startStateIndex, int endStateIndex, Set<int> characters)
{
transitions[startStateIndex] = CreateEmptyTransitionsArray();
foreach (int character in characters) {
    // get the utf8 encoded bytes for the character
    byte[] encoded = EncodingHelper.EncodeCharacter(character);
    int tStartStateIndex = startStateIndex;
    for (int i = 0; i < encoded.Length - 1; i++) {
        int tEndStateIndex = transitions[tStartStateIndex][encoded[i]];
        if (tEndStateIndex == -1) {
           tEndStateIndex = CreateState();
               transitions[tEndStateIndex] = CreateEmptyTransitionsArray();
        }                   
        transitions[tStartStateIndex][encoded[i]] = tEndStateIndex;
        tStartStateIndex = tEndStateIndex;
    }
    transitions[tStartStateIndex][encoded[encoded.Length - 1]] = endStateIndex;
}

Does anyone know how to implement the function much more efficiently to create only the necessary states?

EDIT:

To be more specific I need a function like:

List<Set<byte>[]> Convert(Set<int> characters)
{
     ???????
}

A helper function to convert a character (int) to a UTF8 encoding byte[] is defined as:

byte[] EncodeCharacter(int character)
{ ... }
Slough answered 21/8, 2010 at 19:13 Comment(6)
You are building a xFA for the byte input? Wouldn't it be a lot easier (and more reliable) to operate on (Utf16) chars?Anterior
I don't think so, the size of the lookup table(s) would increase when using 16bit characters. Also the typical input file would be bigger if using utf16 (in comparision with utf8).Slough
I'm sorry, I misunderstood! Accepting any encoding would be a nice option for future version. But to keep it simple, I think it's easier to implement only one encoding, and UTF-8 looks like the right joice for me.Slough
But then I get lookup tables with entries for all 0x10ffff characters. How to implement the transition table then???Slough
You need a 'smart' way to handle char sets. One simple idea is to have 1 0..0x10ffff lookup table (big but possible) to find the set# for each char.Anterior
You need to encode the transitions of your automaton symbolically, e.g., as Ian mentioned by using bounderies. Perhaps the brics automaton library can give you a hint!?Tedric
H
3

There are a number of ways to handle it. They all boil down to treating sets of characters at a time in the data structures, instead of enumerating the entire alphabet ever at all. It's also how you make scanners for Unicode in a reasonable amount of memory.

You've many choices about how to represent and process sets of characters. I'm presently working with a solution that keeps an ordered list of boundary conditions and corresponding target states. You can process operations on these lists much faster than you could if you had to scan the entire alphabet at each juncture. In fact, it's fast enough that it runs in Python with acceptable speed.

Hierogram answered 24/8, 2010 at 16:4 Comment(0)
B
3

I'll clarify what I think you're asking for: to union a set of Unicode codepoints such that you produce a state-minimal DFA where transitions represent UTF8-encoded sequences for those codepoints.

When you say "more efficiently", that could apply to runtime, memory usage, or to compactness of the end result. The usual meaning for "minimal" in finite automata refers to using the fewest states to describe any given language, which is what you're getting at by "create only the necessary states".

Every finite automata has exactly one equivalent state minimal DFA (see the Myhill-Nerode theorem [1], or Hopcroft & Ullman [2]). For your purposes, we can construct this minimal DFA directly using the Aho-Corasick algorithm [3].

To do this, we need a mapping from Unicode codepoints to their corresponding UTF8 encodings. There's no need to store all of these UTF8 byte sequences in advance; they can be encoded on the fly. The UTF8 encoding algorithm is well documented and I won't repeat it here.

Aho-Corasick works by first constructing a trie. In your case this would be a trie of each UTF8 sequence added in turn. Then that trie is annotated with transitions turning it into a DAG per the rest of the algorithm. There's a nice overview of the algorithm here, but I suggest reading the paper itself.

Pseudocode for this approach:

trie = empty
foreach codepoint in input_set:
   bytes[] = utf8_encode(codepoint)
   trie_add_key(bytes)
dfa = add_failure_edges(trie) # per the rest of AC

This approach (forming a trie of UTF8-encoded sequences, then Aho-Corasick, then rendering out DFA) is the approach taken in the implementation for my regexp and finite state machine libraries, where I do exactly this for constructing Unicode character classes. Here you can see code for:

Other approaches (as mentioned in other answers to this question) include working on codepoints and expressing ranges of codepoints, rather than spelling out every byte sequence.

[1] Myhill-Nerode: Nerode, Anil (1958), Linear Automaton Transformations, Proceedings of the AMS, 9, JSTOR 2033204
[2] Hopcroft & Ullman (1979), Section 3.4, Theorem 3.10, p.67
[3] Aho, Alfred V.; Corasick, Margaret J. (June 1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM. 18 (6): 333–340.

Bourgeon answered 11/12, 2019 at 20:23 Comment(1)
Sorry if this is a noob question. Do we really need to add the failure edges, in the above pseudo-code? I mean, we don't need to search for a substring starting somewhere in the input character bytes: once the bytes fail to follow the trie, then we know that the input bytes do not belong to the character class. Did I miss something?Rubious
V
2

Look at what regular expression libraries like Google RE2 and TRE are doing.

Vandusen answered 22/8, 2010 at 20:3 Comment(1)
I think Google RE2 does the kind of thing I need, but it's very complex... I find some interestering code at code.google.com/p/re2/source/browse/re2/compile.cc (starting at line 559)Slough
B
0

I had the same problem with my scanner generator, so I've come up with the idea of replacing intervals by their ids which is determined using interval tree. For instance a..z range in dfa can be represented as: 97, 98, 99, ..., 122, instead I represent ranges as [97, 122], then build interval tree structure out of them, so at the end they are represented as ids that is referring to the interval tree. Given the following RE: a..z+, we end up with such DFA:

0 -> a -> 1
0 -> b -> 1
0 -> c -> 1
0 -> ... -> 1
0 -> z -> 1

1 -> a -> 1
1 -> b -> 1
1 -> c -> 1
1 -> ... -> 1
1 -> z -> 1
1 -> E -> ACCEPT

Now compress intervals:

0 -> a..z -> 1

1 -> a..z -> 1
1 -> E -> ACCEPT

Extract all intervals from your DFA and build interval tree out of them:

{
    "left": null,
    "middle": {
        id: 0,
        interval: [a, z],
    },
    "right": null
}

Replace actual intervals to their ids:

0 -> 0 -> 1
1 -> 0 -> 1
1 -> E -> ACCEPT
Borlase answered 16/5, 2013 at 8:4 Comment(0)
W
0

In this library (http://mtimmerm.github.io/dfalex/) I do it by putting a range of consecutive characters on each transition, instead of single characters. This is carried through all the steps of NFA constuction, NFA->DFA conversion, DFA minimization, and optimization.

It's quite compact, but it adds code complexity to every step.

Wept answered 23/7, 2016 at 21:31 Comment(0)
L
0

My https://metacpan.org/pod/Unicode::SetAutomaton module implements this. Note that regular expressions or automata will usually repeat large sets multiple times, say \w+\W+\w+[\w\d]* has three (four if you count \W as complement) instances of \w, and if the DFA for \w is large, you probably do not want to make multiple copies of it. Accordingly, my module partitions the set of Unicode scalar values so that each scalar value belongs to exactly one partition, and then it computes a DFA where each accepting state corresponds to such a partition. You can use the DFA to turn the input bytes into partitions, and then define higher-level transitions over these partitions, saving a lot of space in the higher-level automaton.

Loggia answered 7/4, 2023 at 18:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.