DFAs vs Regexes when implementing a lexical analyzer?
Asked Answered
E

1

10

(I'm just learning how to write a compiler, so please correct me if I make any incorrect claims)

Why would anyone still implement DFAs in code (goto statements, table-driven implementations) when they can simply use regular expressions? As far as I understand, lexical analyzers take in a string of characters and churn out a list of tokens which, in the languages' grammar definition, are terminals, making it possible for them to be described by a regular expression. Wouldn't it be easier to just loop over a bunch of regexes, breaking out of the loop if it finds a match?

Eschar answered 19/1, 2013 at 22:34 Comment(1)
The main reason is that table driven DFAs can be easily generated by programs (eg. lex).Swoon
B
7

You're absolutely right that, in many cases, it's easier to write regular expressions than DFAs. However, A good question to think about is

How do these regex matchers work?

Most very fast implementations of regex matchers work by compiling down to some type of automaton (either an NFA or a minimum-state DFA) internally. If you wanted to build a scanner that worked by using regexes to describe which tokens to match and then looping through all of them, you could absolutely do so, but internally they'd probably compile to DFAs.

It's extremely rare to see anyone actually code up a DFA for doing scanning or parsing because it's just so complicated. This is why there are tools like lex or flex, which let you specify the regexes to match and then automatically compile down to DFAs behind the scenes. That way, you get the best of both worlds - you describe what to match using the nicer framework for regexes, but you get the speed and efficiency of DFAs behind the scenes.

One more important detail about building a giant DFA is that it is possible to build a single DFA that tries matching multiple different regular expressions in parallel. This increases efficiency, since it's possible to run the matching DFA over the string in a way that will concurrently search for all possible regex matches.

Hope this helps!

Bewhiskered answered 19/1, 2013 at 23:14 Comment(3)
Also Regex patterns tend to be slower than using a good lexer and only good regex systems can handle things like matching muliple nested pairs of delimiters like parens.Siddra
@GuyCoder In a compiler the parser handles the parentheses, not the lexer.Yoon
@EJP Your right. I have my head in parser combinators right now and am not thinking lexer/parser.Siddra

© 2022 - 2024 — McMap. All rights reserved.