Generate an AST in C++

Asked 15/2, 2015 at 19:39 Answered 7/11, 2021 at 19:10

Solved c++algorithm parsing abstract-syntax-tree lalr

I'm making an interpreter in C++, so far I've got my lexer to generate tokens. The problem is I'm not sure how to generate an "walk" a parse tree.

I was thinking of making my parse tree using an array of arrays, but I'm not sure how to actually insert the tokens into the parse tree in the correct order.

I'm not sure whether or not to go top-down, left-right or bottom-up, right-left.

Can anyone provide me with a simple LALR(1) algorithm?

Quarterstaff answered 15/2, 2015 at 19:39 Comment(1)

First, you have to collect a parse tree. A token stream isn't enough. You have not said how you intend to parse the stream of tokens. – Carboxylate 15/2, 2015 at 20:53

I'm going to go against conventional wisdom here and say that you should build your AST manually with natural language-specific data-structures.

A generic "AST data structure" will be too general to work with comfortably -- the code that consumes the AST to do anything useful with it will be obscured by the workarounds to access the data it wants. If you go this route (or use a parser generator), you'll likely end up building a translation layer to go from the generic structure to an AST that actually makes sense for your language. Why not avoid the middleman and build the final data structure directly?

I suggest writing a first syntactic pass that consumes the tokens it needs for each possible construct (probably peeking ahead one token as needed). This syntactic pass would construct the AST out of linked instances of data structures that represent each possible construct in your language. For example, if your program can consist of a series of statements, where each statement can be either a function declaration or a function call, you could create the following data structures:

AST (struct)
    -> std::vector<Statement*> statements

StatementKind (enum class)
    -> Function
    -> Call

Statement (struct)
    -> StatementKind kind

Function : Statement (struct)
    -> std::string name
    -> std::vector<Statement*> body

Call : Statement (struct)
    -> std::string name
    -> Function* called

Then the syntactic pass to build the initial AST might look like:

auto ast = parseAST();

where parseAST calls parseStatement repeatedly, which consumes and/or peeks at tokens to determine whether the statement is a function definition or a function call, then calls parseFunction or parseCall appropriately. This is called a hand-written recursive descent parser, and is in my experience by far the simplest and best type of parser you can write. Yes there is boilerplate to write, but it's also very clear and flexible code.

The syntactic phase generates syntax error messages as it goes, attempting to recover as best as it can when an error is found (this is one argument for semicolon-delimited languages -- the compilers and tools can often recover from errors much more easily, since it often suffices to skip to the next semicolon when an error is encountered before trying to parse the next construct).

If a function is allowed to be called before it is defined, this is simple to handle too -- just parse the call part without checking if a function with that name exists at the time you first parse it, then correlate it later.

A second, semantic, pass through the AST would annotate it with any missing cross-referenced data (e.g. resolving function calls' function names with those functions' definitions, or generating an error if the name is not found). Once that's done, you have a full AST that's guaranteed to represent a valid program (since you did all the semantic error checking in this pass), and can have optimizations applied to it, followed by the code generation phase (and more optimizations along the way).

Note that I completely left out any mention of LL or LR parsers, etc. The theoretical notation and categorization of your language is important (as it can prove it's non-ambiguous, for example), but from an implementation point of view, it's far more important to have clean, easily understood and above all easily modifiable code than to adhere to a specific parsing mechanism. Examples of other compilers that use hand-written parsers include GCC, Clang, and Google's V8, among others.

In terms of efficiency, the AST is generally iterated over several times after it's constructed, and needs to stick around in memory until late in the process (possibly right until the end, depending on how you do your code generation), and so it's best to use an object pool to allocate the records for your AST than to dynamically allocate everything separately on the heap. Finally, in place of strings, it's generally better to use an offset-and-length into the original source code.

Ary answered 15/2, 2015 at 20:28 Comment(2)

One question, is there a way to add a FunctionDefinition struct and VariableDefinition struct to the vector<Statement*> and then depending on the type of the statement access different values (giving the fact that FunctionDefinition and VariableDefinition inherit from Statement struct) – Joniejonina 7/11, 2021 at 14:39

Yes, same as you would with any other discriminated union. Statement in this example has the kind field for this very purpose. – Ary 8/11, 2021 at 19:12

You could use some parser generator like bison or ANTLR. Both have good documentations with tutorial part. The action part of your grammar rules (fed to bisonor antlr, which generates C++ code for parsing) would build the abstract syntax tree.

We can't help more without knowing more the syntax (and the semantics) of the formal language you want to parse and interpret.

If your language is an infix calculator, bison has such an example.

You probably should think of a class hierarchy to represent the nodes of your AST. You'll have a class for leafs (e.g. numbers), a class for addition node (with two sons as smart pointers to other nodes), etc...

e.g.

class ASTNode {
/// from https://mcmap.net/q/796085/-generate-an-ast-in-c
///... add some things here, e.g. a virtual print method
};

class NumberNode : public ASTNode {
   long number;
   /// etc...
};

class BinaryOpNode : public ASTNode {
   std::unique_ptr<ASTNode> left;
   std::unique_ptr<ASTNode> right;
 /// etc....
};

class AdditionNode : public BinaryOpNode {
/// etc....
};

class CallNode : public ASTNode {
   std::shared_ptr<Function> func;
   std::vector<std::unique_ptr<ASTNode>> args;
};

for nodes of variadic arity (i.e. any number of sons) you'll need a vector of smart pointers, like args above.

To traverse the tree, you'll do a recursive traversal so you better use smart pointers. Read also about the visitor pattern. And with C++11 std::function-s and anonymous closures -i.e lambdas- , you could have a visit virtual function (to which you give a closure visiting each node). The Unix nftw(3) function which visits file trees could be inspirational.

Schexnayder answered 15/2, 2015 at 20:6 Comment(4)

Would a 2d vector work to represent the AST? I thought of it because you could traverse the tree by going down 1 and left 1 to move to the next node in the tree. – Quarterstaff 15/2, 2015 at 20:20

You'll need a vector if you have variadic nodes like my CallNode – Schexnayder 15/2, 2015 at 20:23

You don't want to use arrays (otherwise as arrays of pointers in fixed arity nodes) for ASTs. They are a canonical example of pointer usage, and smart pointers are handy for them. – Schexnayder 15/2, 2015 at 20:28

You could use an adjency list, which is a vector of nodes where each node contains a vector of pointers (links to other nodes). If you want polymorphism, each node gets a unique_ptr<ASTBase> that actually contains data for that node. What's more, in addition to DAG trees, this can represent almost any graph, especially by using a multi/map instead of a vector. It also makes traversal and memory management much easier. – Picard 25/1, 2019 at 6:42

People build ASTs as heap-allocated trees. (Yes, you can do it an array but it just isn't as convenient). I suggest you READ the bison documentation; it will explain to you how to build trees and you can follow that style.

Guessing at your experience level based on your question, if I were you I'd build a flex/bison based parser/AST builder at least once to get a good experience, before you came back to building everything yourself.

Carboxylate answered 15/2, 2015 at 19:58 Comment(0)

I used a set of BNF production methods to generate nodes (struct inheritance) of a specific type for generating the AST. With std::move(), you can transfer the pointer ownership to avoid dangling pointers. Then there is a public recursive visit method (switch-case) that traverses the AST following a certain traversal pattern (post/pre order), checks for the AST node type and executes an accept() for each visit. The accepts are bind to a dispatcher interface which must be implemented by the user (print the tree, execute the tree, etc.). For each visit the corresponding method on user side is called.

Craniology answered 9/6, 2020 at 12:16 Comment(0)

Do yourself a favor and use boost spirit. Here is a quick link directly to an example for ASTs. https://www.boost.org/doc/libs/1_77_0/libs/spirit/doc/html/spirit/qi/tutorials/mini_xml___asts_.html

Not only is the library well maintained and thought out, it is also really easy to use and the documentation is excellent. Basically it allows you to write your grammar in an pretty readable C++ way. I highly recommend it.

Develop answered 7/11, 2021 at 19:10 Comment(0)

Recommended topics

Hot tags