How to make semantic check using flex/bison?
Asked Answered
J

1

5

I have created a context free grammar in bison and a scanner in flex. Now i also want to make a semantic check, for example, suppose the input is something like this:

int m=5;
c=c+5;

This input is syntactically correct but there is an undeclared variable being used, namely "c". How can i do such a semantic check? Where should i start? Should i write my code in flex or in bison? I appreciate if anyone can help. Thanks.

Jacobson answered 27/3, 2013 at 17:47 Comment(6)
The code for the semantic check goes in the actions after the rule in the grammar, not in the lexical analyzer. Or it is diagnosed at the end of the parse, if you build up some sort of AST (abstract syntax tree) during the parse. Don't forget that you probably want the compiler to continue parsing after the semantic error. The overall compilation will fail, but you should aim to diagnose other errors if you can (but avoid diagnosing more misuses of c if at all possible).Cosher
While parsing the tree (bison) you should be storing declared variables that are available in the current scope. When you come across a variable that you can not find in the place where you are storing your variables then you have a compilation error.Kaylee
Thanks for the answers. When you say storing, for example, you mean using an array, then checking whether an identifier in that array? If that is so, in which part of the bison file can i do this? Can i write a C code into the bison file?Jacobson
Also, i saw something that is called "symbol table". Can that be related to my question?Jacobson
If you build up a parse tree, the nodes in the tree are normally dynamically allocated rather than being from an array (so you don't run into arbitrary limits because your array wasn't big enough). This is not support that is provided by Bison; rather, it appears in the C code you write as the actions that are executed when the parser built by Bison recognizes a rule. You'll have some internal representation of the code that was parsed, which you will use to generate the output from your 'compiler'.Cosher
One of the tools used when parsing is often a symbol table. It is used to keep a record of the names that are known to the compiler, and the type information associated with those names. So, the symbol table entry for m might note that it is an int; the symbol table entry for c might note that it is an undeclared variable (and therefore not object to it being used after the first error report because you can't tell whether it was being misused or not if the code was syntactically correct).Cosher
B
8

The first thing to consider is: at what point do we have enough information so that we can do the semantic check?

For a static language like C, we can do this semantic right at parse time, with a syntax-directed rule, such as those trigged in Yacc.

Your parser needs to maintain symbol tables. That is to say, whenever you open a new scope such as new function body or statement block, you have to create a new symbol table object for that scope (and keep a pointer to that in some global parser variable as the "current scope"). The scope also has a pointer to the previous scope. When the scope closes, you restore the original scope as "current scope". This scope opening and closing is tied to the parser rules which handle the block constructs like function or statement bodies, or structure bodies.

The scope contains associations between variable names and semantic information, like what kind of a symbol it is, and other attributes like type.

When your parser processes a declaration of some kind, then the declared name is introduced into the current symbol table, and thereafter it is known.

So, fast forward to our problem: how to check that a name is not defined. This is not difficult. Somewhere, your parser has rules like

primary_expression : '(' expression ')'
                   /* ...*/
                   | CONSTANT
                   | IDENT
                   ;

A primary expression can be an identifier such as a variable, constant or the name of a function. If the rules are strict, that these have to be defined if they can be used, we can put the check right here.

For the action rule of IDENT, we look up the identifier in the current symbol table. If the search comes up with nothing, we raise an error that there is an undefined identifier.

Pseudo-code:

primary_expression : '(' expression ')'
                   /* ...*/
                   | CONSTANT
                   | IDENT {
                       struct symbol *sym = symbol_lookup(current_scope, $1);
                       if (sym == NULL) {
                         static_error("undeclared identifier %s", $1);
                         $$ = error_node();
                       } else {
                         /* ... */
                       }
                     }

The symbol_lookup function does not only look in the current scope! If the identifier is not found in the current scope, it recurses into the parent scope and so on. The toplevel scope in the chain of scopes is the file scope. If the identifier is found there, then it is a global identifier of some kind. If it is not found there either, it's undefined. I also made up static_error; it has printf-like arguments, and adds file/line number information, and increments the error count (so that when the parser is done it can indicate failure based on the error count being nonzero). I made up error_node also; it's a function or macro that produces some kind of node which indicates an error (perhaps just a null pointer). Your parser rule has to produce something and store it into $$. For an identifier that does not exist, we can put some marker into the tree instead.

If you're writing a compiler in C using Yacc, you have a lot of work to do to invent all these data structures like symbol tables and write the supporting libraries.

Barcus answered 8/6, 2013 at 1:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.