DFAs vs Regexes when implementing a lexical analyzer?

You're absolutely right that, in many cases, it's easier to write regular expressions than DFAs. However, A good question to think about is

How do these regex matchers work?

Most very fast implementations of regex matchers work by compiling down to some type of automaton (either an NFA or a minimum-state DFA) internally. If you wanted to build a scanner that worked by using regexes to describe which tokens to match and then looping through all of them, you could absolutely do so, but internally they'd probably compile to DFAs.

It's extremely rare to see anyone actually code up a DFA for doing scanning or parsing because it's just so complicated. This is why there are tools like lex or flex, which let you specify the regexes to match and then automatically compile down to DFAs behind the scenes. That way, you get the best of both worlds - you describe what to match using the nicer framework for regexes, but you get the speed and efficiency of DFAs behind the scenes.

One more important detail about building a giant DFA is that it is possible to build a single DFA that tries matching multiple different regular expressions in parallel. This increases efficiency, since it's possible to run the matching DFA over the string in a way that will concurrently search for all possible regex matches.

Hope this helps!

Recommended topics

Hot tags