direct-coded vs table-driven lexer?

Asked 4/1, 2015 at 7:49 Answered 27/10, 2021 at 9:8

Solved compiler-construction lexer table-driven

I'm new in compiler construction world , I want to know what are the differences between direct-coded vs table-driven lexer analyzer ?

Please use simple source code example if it's possible.

Thanks.

Edit :

in Engineering a Compiler book, the author divided lexers into three(3) types: table-driven,direct coded, and hand coded.

Arredondo answered 4/1, 2015 at 7:49 Comment(7)

Table-driven is the one which is based on lookup-table as mentioned in the below answer. Direct-coded is the approach which simply does the coding for a DFA automaton. It's the common one. And, hand-coded are those which have fixed transitions and is valid for a small-language.! – Abbott 4/1, 2015 at 14:33

Fortunately I happen to have the Torczon book. Let me take a look. – Uncloak 4/1, 2015 at 14:51

Alright, I've edited my post with an example of a direct-coded (as in the Torczon book) lexer as well. – Uncloak 4/1, 2015 at 14:57

@Noobs-Most welcome buddy! And,cheers. – Abbott 4/1, 2015 at 15:0

@shekharsuman lex/flex use table-driven,while quex use direct-coded en.wikipedia.org/wiki/Lexical_analysis#Lexer_generator – Arredondo 4/1, 2015 at 15:1

@shekharsuman I don't know,I just read that in Wikipedia, see this link, it says "..The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach..." . anyway it doesn't matter, I will not use them :-) – Arredondo 4/1, 2015 at 15:28

@Noobs- Yeah,they seem to use table-driven approach. THANKS. – Abbott 4/1, 2015 at 18:20

~~I shall assume that you by "direct-coded" mean a hand-written lexer rather than one produced as the output of a lexer generator. In that case...~~ (See below.)

... a table-driven lexer is a (usually automatically generated) program that uses some kind of lookup table to determine which action to take. Consider the finite automaton that corresponds to the regular expression ab*a (intentionally not minimized):

DFA for ab*a

If we limited ourselves to only considering characters 'a' and 'b', we could encode it in a lookup table like so:

#define REJECT -1

/* This table encodes the transitions for a given state and character. */
const int transitions[][] = {
    /* In state 0, if we see an a then go to state 1 (the 1).
     * Otherwise, reject input.
     */
    { /*a*/  1,  /*b*/  REJECT },
    { /*a*/  2,  /*b*/  3      },
    { /*a*/ -1,  /*b*/ -1      }, /* Could put anything here. */
    { /*a*/  2,  /*b*/  3      }
};

/* This table determines, for each state, whether it is an accepting state. */
const int accept[] = { 0, 0, 1, 0 };

Now we just need a driver that actually uses the table:

int scan(void) {
    char ch;
    int state = 0;

    while (!accept[state]) {
        ch = getchar() - 'a'; /* Adjust so that a => 0, b => 1. */
        if (transitions[state][ch] == REJECT) {
            fprintf(stderr, "invalid token!\n");
            return 0; /* Fail. */
        } else {
            state = transitions[state][ch];
        }
    }
    return 1; /* Success! */
}

Of course, in a real lexer every token would have a corresponding accepting state, and you would have to somehow modify the accept table to contain the token identifier. I want to stress two things, though:

A table-driven lexer doesn't necessarily operate on the level of DFA states.
I don't recommend writing table-driven lexers by hand as it is a tedious and error-prone process.

A hand-written (for lack of a better name) lexer doesn't usually use a lookup table. Suppose we want a lexer for a simple Lisp-like language that has parentheses, identifiers and decimal integers:

enum token {
    ERROR,
    LPAREN,
    RPAREN,
    IDENT,
    NUMBER
};

enum token scan(void) {
    /* Consume all leading whitespace. */
    char ch = first_nonblank();
    if (ch == '(') return LPAREN;
    else if (ch == ')') return RPAREN;
    else if (isalpha(ch)) return ident();
    else if (isdigit(ch)) return number();
    else {
        printf("invalid token!\n");
        return ERROR;
    }
}

char first_nonblank(void) {
    char ch;
    do {
        ch = getchar();
    } while (isspace(ch));
    return ch;
}

enum token ident(void) {
    char ch;
    do {
        ch = getchar();
    } while (isalpha(ch));
    ungetc(ch, stdin); /* Put back the first non-alphabetic character. */
    return IDENT;
}

enum token number(void) {
    char ch;
    do {
        ch = getchar();
    } while (isdigit(ch));
    ungetc(ch, stdin); /* Put back the first non-digit. */
    return NUMBER;
}

Like the table-driven lexer example, this one is not complete. For one thing, it needs some kind of buffering to store the characters that are part of a token like IDENT and NUMBER. For another, it doesn't handle EOF particularly gracefully. But hopefully you get the gist of it.

Edit: Based on the definition in Engineering a Compiler, a direct-coded lexer is basically a hybrid of the two. Rather than using a table, we use code labels to represent states. Let's see how that would look using the same DFA as before.

int scan(void) {
    char ch;

state0:
    ch = getchar();
    if (ch == 'a') goto state1;
    else { error(); return 0; }

state1:
    ch = getchar();
    if (ch == 'a') goto state2;
    else if (ch == 'b') goto state3;
    else { error(); return 0; }

state2:
    return 1; /* Accept! */

state3:
    ch = getchar();
    if (ch == 'a') goto state2;
    else if (ch == 'b') goto state3; /* Loop. */
    else { error(); return 0; }
}

In my personal experience the "best" approach to writing lexers is the hand-written approach I outlined above. It works on every platform, in every language, you need not learn ancient tools like lex, and, perhaps most importantly, you have the flexilbility to extend the lexer with features that are difficult or impossible to implement with a tool. For example, perhaps you want your lexer to understand Python-esque block indentaiton, or maybe you need to dynamically extend certain token classes.

Vaca answered 4/1, 2015 at 13:38 Comment(5)

Pretty happy that people are there to spend this much of time writing such worthy contents. Keep going. +1... – Abbott 4/1, 2015 at 13:44

SORRY TO DISTURB,but, I guess Lex,etc. use direct-coded approach! They aren't table driven! How come you derive that,,I didn't get any source! They are fast since they are direct-driven and based on states/transitions. Hand coded are the one we code for a very small language as hand-coding a compiler of modern languages like Java,etc. would be a burden. Please convince me,I ain't convinced! – Abbott 4/1, 2015 at 15:9

I haven't made any claims about how lex or any other lexer generator is actually implemented. Note that these days, lex really refers to a family of somewhat compatible tools, the most prominent of which is probably Flex (flex.sourceforge.net). As far as hand-written lexers are concerned, their use extends far beyond simple languages. For instance, the V8 JavaScript engine uses a hand-written lexer augmented with a table for single-character tokens; see github.com/v8/v8-git-mirror/blob/master/src/scanner.cc. – Uncloak 4/1, 2015 at 15:44

As a concrete example of lexer generators that output table-driven lexers, I present ML-Lex, a lexer generator for the SML/NJ and MLton implementations of Standard ML. :-) – Uncloak 4/1, 2015 at 15:47

Note that both direct-style lexers compile to the same code if your C compiler supports tail call elimination. – Hokkaido 23/9, 2016 at 8:50

-1

Look at my lexer generator, it is very simple and easy to uderstand, it generates DFA direct code automata, as nested switch instructions. I used such approach for my projects, firstly was hand written and later generated by using this tool. The approach is based on my experience by studying this topic by reading several books and studying the implementations of the more advanced parser generators. There is a project on github - rmjocz/langgen

Andesine answered 27/10, 2021 at 9:8 Comment(0)

Recommended topics

Hot tags