How to implement a language interpreter without regular expressions?
Asked Answered
R

4

6

I am attempting to write an interpreted programming language which will read in files and output a bytecode-like format which can then be executed by a virtual machine.

My original plan was:

  • Begin by loading the contents of the file in to the memory of my interpreter,
  • read each line (where a word is defined as being a series of letters followed by a space or escape character),
  • using regular expressions, determine the parameters of the word,
  • for example, IF myvalue = 5 THEN will become IF myvalue 5,
  • convert each of these words to the byte form (i.e. IF would become 0x09),
  • execute these bytes one by one (as the executor will understand that IF or 0x09 is followed by two bytes.

I have been told that regular expressions is a terrible way to go about doing this, but I'm unsure as to if this is a good or bad way of implementing an interpreted language.

This is mainly just for experience, so I don't mind if it isn't exactly performance friendly, but that would be a benefit.

What is the best way of implementing my interpreter, and are there any examples (written in plain old C)?

Roughspoken answered 1/4, 2015 at 16:23 Comment(4)
Have you looked at lex and yacc (or the FSF analogues flex and bison)? Regular expressions are used in lex, a grammar is used in yacc.Rettke
"I have been told that regular expressions is a terrible way to go about doing this" This is wrong. Regular expressions are a great way to parse tokens, even if they can't parse the language entirely.Knut
Whether regular expressions are a good way or a poor way to parse parts of your language depends on the details of your language. They are frequently used for such purposes, though, and have been for a long time. The lex lexical analyzer and its derivatives depend heavily on regexes, for example.Noggin
It is not about making an interpreter, it is just about parsing which is the first step of an interpreter.Homicide
O
6

The reason why people tell you that regular expressions aren't the best idea for a case like this is because regular expressions take more time to evaluate, and the regular expression language has many limitations and quirks that make it unsuitable for many applications.

You should be aware that this is just a knee-jerk reaction that many programmers have (including myself), regardless of whether or not regular expressions would actually be suitable for the application. This stems from people trying to do too much with regular expressions, such as trying to parse HTML.

Many compilers use a basic single-pass tokenization algorithm. The tokenizer has a very basic idea of what can be used as a delimiter, how constants and identifiers should be treated, etc. The tokenizer will then just iterate through the input quickly and emit a string of tokens that can then be easily parsed.

For a basic application, such as parsing tokens, the relatively minor penalty from using regular expressions isn't something to worry about. However as I said, there are sometimes peculiarities with how regular expressions work that might limit the things that your tokenizer can do, though that work can usually be offloaded to a later point in the compiler pipeline. Everything that a tokenizer is expected to do should be representable by standard regular expressions though.

It should be noted that when you involve regular expressions directly in your code, things can get a little hairy. You have to determine how the regular expressions should be applied to the input, where input will be delineated, and so on. You'll also incur the penalty of compiling and evaluating the regular expressions.

There are some projects, such as lex, which make use of regular expressions because they provide a simple, concise way to describe a grammar, which it can then make use of internally in whatever representation in chooses. They will also handle all of the glue logic for you and you merely have to describe the grammar that it should use via regular expressions.

When such a generator is used , it can change any regular expressions to code that represents what the expression actually means. If it sees the expression [0-9], it can just replace that with a call to isdigit, an equivalent switch statement, or some other representation. This makes the generated code much more efficient than any inline use of regular expressions could achieve.

So my thought is this: If you want to use regular expressions to parse your language, go all the way and create a scanner description for flex/lex to generate the tokenizer for you. However, if you're actually writing it entirely yourself, you'd be better off with a logically simpler approach like the one I described.

I thought it would be fun to write up a sample tokenizer that doesn't use regular expressions, so here it is. I wrote it in C-like C++. The only C++ feature I used was the standard vector and string, but I did it in such a way that you could easily drop in C variants.

#include <vector>
#include <ctype.h>
#include <string>

typedef std::vector<std::string> string_list;
typedef std::vector<long long > int_list;
typedef std::vector<long double> float_list;

std::string substr(const char* value, size_t length){
    std::string v;
    v.resize(length);
    memcpy(&v[0], value, length * sizeof(char));
    return v;
}

long long string_to_int(const char* value, size_t length){
    return atoll(substr(value, length).c_str());
}
long double string_to_float(const char* value, size_t length){
    return atof(substr(value, length).c_str());
}


void int_list_add(int_list& list, long long value){
    list.push_back(value);
}
void string_list_add(string_list& list, const char* value, size_t length){
    list.push_back(substr(value, length));
}
void float_list_add(float_list& list, long double value){
    list.push_back(value);
}
size_t int_list_last(int_list& list){
    return list.size();
}
size_t string_list_last(string_list& list){
    return list.size();
}
size_t float_list_last(float_list& list){
    return list.size();
}



typedef struct{
    string_list identifiers;
    string_list constants_string;
    int_list constants_int;
    float_list constants_float;
    size_t id;
} *state, state_value;

state tok_state_create(){
    state ret = new state_value;
    ret->id = 0;
    return ret;
}
void tok_state_destroy(state t_state){
    delete t_state;
}
const char* tok_state_read_identifier(state t_state, size_t id){
    return t_state->identifiers[id - 1].c_str();
}
const char* tok_state_read_string(state t_state, size_t id){
    return t_state->constants_string[id - 1].c_str();
}
long long tok_state_read_int(state t_state, size_t id){
    return t_state->constants_int[id - 1];
}
long double tok_state_read_float(state t_state, size_t id){
    return t_state->constants_float[id - 1];
}



const char* punct_tokens[] = { "Not A Token (Dummy)",
".", ",", "<", "<<", ">", ">>",
";", "+", "-", "/", "*", "!", "%", "^",
"&", "(", ")", "=", "==", "[", "]", "{",
"}", "?", ":", "|", "||", "&&", "~", 0
};

const char* key_tokens[] = { "Not A Token (Dummy)",
"if", "while", "do", "then", "end", 0
};

typedef enum{
    TOK_TYPE_INTEGER = 500,
    TOK_TYPE_FLOAT,
    TOK_TYPE_STRING,
    TOK_TYPE_IDENTIFIER,
    TOK_TYPE_NONE
} tok_type;

const char* get_token_from_id(size_t id){
    if (id < 100){
        return punct_tokens[id];
    }
    if (id < 200){
        return key_tokens[id - 100];
    }
    if (id >= 500){
        switch (id){
        case TOK_TYPE_INTEGER:      return "Integer Constant";
        case TOK_TYPE_FLOAT:        return "Float Constant  ";
        case TOK_TYPE_STRING:       return "String Constant ";
        case TOK_TYPE_IDENTIFIER:   return "Identifier      ";
        case TOK_TYPE_NONE:         return "Unknown         ";
        default:
            break;
        }
    }
    return "Not A Token (Dummy)";
}

int is_identifier_char(char c){
    if (isalpha(c) || c == '_'){
        return 1;
    }
    return 0;
}

size_t read_punct_token(const char* input, size_t size){
    size_t max_len = 0;
    size_t token_id = 0;
    for (size_t i = 1; punct_tokens[i] != 0; ++i){
        size_t len = strlen(punct_tokens[i]);
        if (len > max_len && len <= size && strncmp(punct_tokens[i], input, len) == 0){
            max_len = len;
            if (i == 1 && size > 1 && isdigit(input[1])){
                return 0; //Special case for floats
            }
            token_id = i;
        }
    }
    return token_id;
}

size_t read_key_token(const char* input, size_t size){
    size_t max_len = 0;
    size_t token_id = 0;
    for (size_t i = 1; key_tokens[i] != 0; ++i){
        size_t len = strlen(key_tokens[i]);
        if (len > max_len && len <= size && strncmp(key_tokens[i], input, len) == 0){
            max_len = len;
            token_id = i + 100;
        }
    }
    return token_id;
}


size_t is_punct_token_char(char c){
    for (size_t i = 1; punct_tokens[i] != 0; ++i){
        if (punct_tokens[i][0] == c){
            return 1;
        }
    }
    return 0;
}


void add_token(state t_state, tok_type type, const char* string, size_t length){
    switch (type){
    case TOK_TYPE_INTEGER:
        int_list_add(t_state->constants_int, string_to_int(string, length));
        t_state->id = int_list_last(t_state->constants_int);
        break;
    case TOK_TYPE_FLOAT:
        float_list_add(t_state->constants_float, string_to_float(string, length));
        t_state->id = float_list_last(t_state->constants_float);
        break;
    case TOK_TYPE_STRING:
        string_list_add(t_state->constants_string, string, length);
        t_state->id = string_list_last(t_state->constants_string);
        break;
    case TOK_TYPE_IDENTIFIER:
        string_list_add(t_state->identifiers, string, length);
        t_state->id = string_list_last(t_state->identifiers);
        break;
    default:
        //Do some error here
        break;
    }
}

size_t get_token(state t_state, char** input, size_t *size){
    if (t_state->id != 0){
        size_t id = t_state->id;
        t_state->id = 0;
        return id;
    }
    char* base = *input;
    size_t padding = 0;
    size_t length = 0;
    tok_type type = TOK_TYPE_NONE;
    while (*size > 0){
        if (isspace(*base)){
            base++;
            (*size)--;
        }
        else{
            break;
        }
    }

    size_t tok = read_punct_token(base, *size);
    if (tok){
        size_t len = +strlen(get_token_from_id(tok));
        *input = base + len;
        *size -= len;
        return tok;
    }
    tok = read_key_token(base, *size);
    if (tok){
        size_t len = +strlen(get_token_from_id(tok));
        *input = base + len;
        *size -= len;
        return tok;
    }

    while (*size - length > 0){
        if (length == 0 && type == TOK_TYPE_NONE){
            if (is_identifier_char(*base)){
                type = TOK_TYPE_IDENTIFIER;
                length++;
            }
            else if (*base == '"'){
                type = TOK_TYPE_STRING;
                padding = 1;
                base++;
                (*size)--;
            }
            else if (*base == '.' && *size > 1 && isdigit(base[1])){
                type = TOK_TYPE_FLOAT;
            }
            else if (isdigit(*base)){
                type = TOK_TYPE_INTEGER;
            }
            else if (is_punct_token_char(*base)){
                tok = read_punct_token(base, *size);
                if (tok){
                    size_t len = strlen(punct_tokens[tok]);
                    *input += len;
                    *size -= len;
                    return tok;
                }
                else{
                    //do error
                }
            }
        }
        else{
            if (!isspace(base[length]) || type == TOK_TYPE_STRING){
                switch (type){
                case TOK_TYPE_INTEGER:
                    if (isdigit(base[length])){
                        length++;
                        continue;
                    }
                    else if (base[length] == '.' || tolower(base[length]) == 'e'){
                        type = TOK_TYPE_FLOAT;
                        length++;
                        continue;
                    }
                    break;
                case TOK_TYPE_FLOAT:
                    if (isdigit(base[length]) || base[length] == '.' || base[length] == 'e'){
                        length++;
                        continue;
                    }
                    break;
                case TOK_TYPE_STRING:
                    if (base[length] != '"'){
                        length++;
                        continue;
                    }
                    break;
                case TOK_TYPE_IDENTIFIER:
                    if (is_identifier_char(base[length])){
                        length++;
                        continue;
                    }
                    break;
                default:
                    break;
                }
            }
            //We only get here if this is a space or any of the switch cases didn't continue.
            add_token(t_state, type, base, length);
            *input = base + length + padding;
            *size -= length + padding;
            return type;
        }
    }
    *input = base + length + padding;
    *size -= length + padding;
    return 0;
}

int main(){
    const char* input = "if(1+1==4)then print\"hi!\";end";
    state s = tok_state_create();
    size_t size = strlen(input);
    size_t token;
    size_t token_prev = 0;
    printf("Token\tMeaning\n\n");

    while ((token = get_token(s, (char**)&input, &size)) != 0){
        if (token_prev < 500){
            if (token < 500){
                printf("%d\t%s\n", token, get_token_from_id(token));
            }
            else{
                printf("%d\t%s #", token, get_token_from_id(token));
            }
        }
        else{
            printf("%d\t", token);
            switch (token_prev){
            case TOK_TYPE_IDENTIFIER: printf("%s\n", tok_state_read_identifier(s, token)); break;
            case TOK_TYPE_STRING: printf("%s\n", tok_state_read_string(s, token)); break;
            case TOK_TYPE_INTEGER: printf("%d\n", tok_state_read_int(s, token)); break;
            case TOK_TYPE_FLOAT: printf("%f\n", tok_state_read_float(s, token)); break;

            }
        }
        token_prev = token;
    }

    tok_state_destroy(s);
}

This will print:

Token   Meaning

101     if
16      (
500     Integer Constant #1     1
8       +
500     Integer Constant #2     1
19      ==
500     Integer Constant #3     4
17      )
104     then
503     Identifier       #1     print
502     String Constant  #1     hi!
7       ;
105     end
Oceanic answered 1/4, 2015 at 19:49 Comment(4)
Scanner generators like flex (1) use regular expressions and (2) produce very fast scanners. I dare say that a naively written flex scanner would be (3) much easier to read and (4) quite possibly faster and certainly not significantly slower than your example scanner.Samella
@Samella I edited my answer to make it more correct. I will agree that flex scanner definitions are definitely easier to read and write, after all that's the entire reason why the project came about in the first place: to make writing scanners easier. I made it clear that regular expressions in and of themselves aren't the problem; it's the idea of invoking a regular expression parser from your code that's bad. Flex knows what the regex are at compile time and can thus break them down into simpler functions that don't incur performance penalties from parsing regex at run-time.Oceanic
Actually, flex (and lex) do not use isdigit either. They compile the regular expression into a DFA and use a table-based approach to walk the DFA at runtime. Tables are faster than open code because they don't require as much test&branch. Nothing stops a regex library from using the same strategy -- re2 does, for example -- but it only works for pure regular expressions: no back-references, no captures, no fancy operators. However, it is almost always good enough for tokenizing.Samella
I was speaking in generalities even though I mentioned a specific project. I've fixed the language. Also, even if a regular expression library is highly optimized, it's still working at run-time. You'd also have to write all the glue code anyways, which is almost as much of a headache as writing the parser without regex in the first place.Oceanic
C
1

As others have mentioned, there's nothing wrong with regex. Here's a typical path:

  1. Parse your source file into tokens
  2. Generate an abstract syntax tree from your stream of parsed tokens
  3. Walk your AST to generate your bytecode

Flex/bison are great tools for 1/2.

You will probably also need some sort of symbol table for variable definitions.

Many compiler design courses will end up implementing a compiler for some subset of the c language, you could try looking at open courseware type sites for an actual class on this.

Clingfish answered 1/4, 2015 at 17:3 Comment(0)
M
1

As the other answers have said there's nothing wrong with regular expressions per se. But more than this, Regular expressions are the natural representation of character-level syntax.

All the major (and many (most?) minor languages) use a regular expression to define the appearance of different input tokens, at least as a formalism in designing the language.

But there may indeed be a performance penalty for certain tricks provided by regex libraries. This is because many things, like backreferences have to be implemented by a less-powerful automaton than a more pure regular expression.

Regular-expressions (without fancy stuff like backreferences) can be transformed into a finite automaton or state-machine and that can be implemented with a much simpler (ie. faster) controlling function.

At the very least, for efficiency, you should try to pre-compile your languages defining expressions and use the compiled matcher objects instead of constructing a new one dynamically each time. If your regex library doesn't offer this, maybe time to do some library shopping, or read up on automata theory.

Moselle answered 23/4, 2015 at 6:58 Comment(0)
R
1

Any interpreter or compiler of a sophisticated language ("has expressions with (...) or compound nested statements like do-while, if-then-else") requires that you build a parser to extract the (usually recursive) structure of the code.

You've gotten lots of answers here saying "regular expressions are good for tokens". Yes, in many classically built compilers and interpreters, people write regular expressions to describe the shape of individual tokens (identifiers, keywords, numeric and string constants, comments) and hand them to a lexer-generator (like Flex or many others) that combines these into an efficient finite-state machine. This works because the "syntax" of tokens is almost always pretty simple. That very simplicity means that you can hand code a token lexer yourself and produce practical results for small to medium size languages. (At some point (e.g., COBOL) the sheer size of the language starts to overwhelm you and it is hard to avoid using lexer generator if you want to stay sane).

What hasn't been discussed is real parsing, discovering structure assuming we already have tokens built somehow. Using regex for tokens IS NOT PARSING. And regexes cannot be used for parsing; they cannot recognize nested structures, which is what is needed for those "sophisticated" constructs. People unfamiliar with parsing make this mistake repeatedly.

If you want to succeed, you need to learn how to build a parser.

For complex languages, there are parser generators (YACC, JavaCC, many others), as analog to lexer generators, that will take a BNF and generate a parser for you. If you want to write an compiler with such tools, you attach parser actions typically to grammar-rule recognition points, often to build a tree for later processing.

You can also hand-code a parser for modest size languages. This is a set of mutually recursive procedures, one for each grammar rule, that analyze the code (using recursion to handle nested constructs). You can also attach parse actions to the procedures. Since these procedures recognize grammar constructs, this is really essentially the same trick as you would apply if you were using a parser generator. This scheme can be trivially extended to handle the "parsing"(lexing) of tokens. If you go down this path completely, there are no regular expressions, anywhere.

It is possible to build an interpreter by executing interpreter-actions in the grammar-rule recognition places/in the hand-coded recursive parsing procedures. It won't run very fast because it will continually reparse the source code.

It is better to build an (abstract) syntax tree (AST) representing the program, and then build and interpreter that crawls the AST to execute it. The AST is constructed by attaching tree-building actions as parser actions.

If you want to produce byte-code, then you have a classic code generation problem. These are generally best solved by building an AST, walking the AST almost as an interpreter would, and spitting out byte codes that are intended to achieve the purpose of the interpreter at each point. You can build an on-the-fly code generator by producing byte codes in the parser actions. It is harder to generate "good" code this way, because the code generator cannot see enough context to handle special cases well.

You can fumble your way through all of this. Your best approach is to get a compiler book, or take a compiler class. Then all of this will be much clearer. Certainly if you fumble your way through this, you will have a much better appreciation of the kinds of machinery the compiler people use to do this. (Google my essay on "Life After Parsing").

Repatriate answered 23/4, 2015 at 8:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.