can the compiler feasibly calculate a DFA from a regular expression?
Asked Answered
C

1

7

In modding a closed-source game I'm modifying the machine code at runtime to jmp into my own code. To do this in a generic manner I'm using pattern matching to find the code locations I want to modify. (The patterns consist only of characters/bytes and wildcards where bytes can vary.) By building a deterministic finite automaton from all my patterns I can search in linear time.

However I've found that building the DFA takes more time than actually running it, especially in debug builds (which I certainly want during development), and that's only going to get worse as I add more patterns. But that could easily be done offline. I'm currently thinking about the how; can the compiler do it?

As far as I know it's impossible with constexpr functions since I can't allocate static memory with them. But I have a vague feeling that it should be doable in a type-safe manner with template meta-programming. Or am I likely to run into recursion limits when creating automatons with hundreds or thousands of states?

And regardless of technical possibility, is it reasonable? Or should I rather, say, calculate a source file in a separate build step?

Cyanamide answered 22/12, 2014 at 22:27 Comment(6)
Another thing to consider: you might want to build the DFA for the regexp using custom tool you write and serialize it to some binary file. Then, at runtime you will just need to import the DFA instead of building it from scratch. It should be faster than just building the DFA in runtime.Hedger
I guess I'm just ignorant but why would a DFA constructed at compile-time and knowing its size would need any memory allocation? Sure, each construction would yield a different type but that shouldn't be an issue. I'd suspect that recursion depth are the more likely constraint, though (I haven't tried creating a DFA during compile-time).Include
I'm not sure whether it actually generates a DFA or an NFA, but Boost Xpressive (for only one example) at least converts a regular expression to some sort of FSM at compile time (and a number of others do similar things as well).Galvez
DFA runs in linear time after it is built, but building the DFA takes exponential time in the worst case (since it goes from RE -> NFA ->DFA). cs.stackexchange.com/questions/130/…Paulina
If building the DFA is prohibitively expensive at run-time I doubt that the compiler will be able to construct it successfully at compile-time. Even if it could, I don't see how it would buy you anything. You will still need to rebuild it each time you compile to debug your latest changes -- however. I think you will find using TMP to be slower and more difficult to troubleshoot than doing it at run-time.Rosemare
Expanding on bialpo's idea, rather than serializing and re-importing the DFA at run-time, build the DFA as a stand-alone service. Then you only need to construct it once when the service initializes, and you are free to make as many service requests (via some API you define) as you want after that. You can re-compile and re-run the other parts of your program cheaply, while the DFA service runs in the background.Rosemare
E
2

Yes, this is possible. The construction can be done with one of the standard algorithms such as Thompson's construction algorithm to get an NFA and then building an DFA from that. The problem is that when converting a NFA to a DFA an exponential blowup in the number of states is possible.

How to deal with the required recursion depth is discussed in the answers to this question.

It is possible to implement the algorithm using template metaprogramming. A list of basic building blocks can be found here, which allows you to store values, implement branches and functions.

Here is an example for a function from the linked page:

template<int X, int Y>
struct Adder
{
   enum { result = X + Y };
};

This is a function that adds its two parameters and stores the result in the result enum member. You can call this at compile time with something like Adder<1, 2>::result, which will be expanded at compile time and act exactly like a literal 3 in your program.

Since Thompson's algorithm relies on recursion, here an example for evaluating a recursion:

template <unsigned n>
struct factorial
{
  enum { value = n * factorial<n-1>::value };
};

This implements a factorial at compile time. This could then be used during runtime like this factorial<5>::value.

Exceedingly answered 27/2, 2015 at 16:6 Comment(1)
I just don't understand how I'd represent and calculate an arbitrarily large set of states pointing to each other in TMP. But I'm probably not going to do this with TMP, instead caching the result of an online computation in a file, that's much simpler.Cyanamide

© 2022 - 2024 — McMap. All rights reserved.