How to efficiently implement longest match in a lexer generator?
Asked Answered
B

1

8

I'm interested in learning how to write a lexer generator like flex. I've been reading "Compilers: Principles, Techniques, and Tools" (the "dragon book"), and I have a basic idea of how flex works.

My initial approach is this: the user will supply a hash map of regexes mapping a regex to a token enum. The program will just loop through the regexes one by one in the order given and see if they match the start of the string (I could add a ^ to the beginning of each regex to implement this). If they do, I can add the token for that regex to a list of tokens for the program.

My first question is, is this the most efficient way to do it? Currently I have to loop through each regex, but in theory I could construct a DFA from all of the regexes combined and step through that more efficiently. However, there will be some overhead from creating this DFA.

My second question is, how would I implement the longest matching string tie breaker, like flex does? i.e, I want to match ifa as an identifier, and not the keyword if followed by the letter a. I don't see any efficient way to do this with regex. I think I'll have to loop through all of the regexes, try to match them, and if I have more than one match, take the longest result. However, if I converted the regexes to a DFA (that is, my own DFA data structure), then I could continue stepping through the characters until there are no more possible transition edges on the DFA. At that point, I can take the last time I passed through an acceptance state as the actual match of a token, since that should be the longest match.

Both of my questions point to writing my own translator from regex to a DFA. Is this required, or can I still do this efficiently with plain regex (implemented by a standard library) and still get the longest match?

EDIT: I kept the regex engine I'm using out of this because I wanted a general answer, but I'm using Rust's regex library: http://static.rust-lang.org/doc/master/regex/index.html

Bos answered 1/6, 2014 at 5:32 Comment(1)
Which regex engine are you using? POSIX engines observe the leftmost-longest rule, Perl-like engines generally use the leftmost-first match (so you can make sure it's the longest one by putting the longer submatches first in your regex).Disincline
R
10

Timewise, it's much more efficient to compile all the regexes down into a single automaton that matches all patterns in parallel. It might blow up the space usage significantly, though (DFAs can have exponentially many states relative to the pattern sizes), so it's worth investigating whether this will hurt.

Typically, the way you'd implement maximal-munch (matching the longest string you can) is to run the matching automaton as normal. Keep track of the index of the last match that you find. When the automaton enters a dead state and stops, you can then output the substring from the beginning of the token up through the match point, then jump back in the input sequence to the point right after the match finished. This can be done pretty efficiently and without much slowdown at all.

In case it helps, here are some lecture slides from a compilers course I taught that explores scanning techniques.

Hope this helps!

Ramires answered 1/6, 2014 at 5:44 Comment(3)
But if you use the regex "1|(1*2)" to match the string "1111111.....1", the DFA will run to the end of input without entering a dead state. So it reads to the end of input and jumps back every time. The running time is O(n^2). Can string matching be done in O(n) time using DFA?Yellowweed
You're exactly right, and the last time I taught a compilers course I actually gave that as an exercise! I'm not sure off the top of my head how to make this worst-case efficient, though in practice this is a very fast approach and rarely degenerates.Ramires
I get "Access Forbidden" when trying to retrieve the lecture slides linked in this answer.Colima

© 2022 - 2024 — McMap. All rights reserved.