how typedef-name - identifier issue is resolved in C?
Asked Answered
T

1

6

I've been recently writing parser for language based on C. I'm using CUP (Yacc for Java).

I want to implement "The lexer hack" (http://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-c%E2%80%99s-grammar-revisited/ or https://en.wikipedia.org/wiki/The_lexer_hack), to distinguish typedef names and variable/function names etc. To enable declaring variables of the same name as type declared earlier (example from first link):

typedef int AA;

void foo() {
    AA aa;       /* OK - define variable aa of type AA */
    float AA;    /* OK - define variable AA of type float */
}

we have to introduce some new productions, where variable/function name could be either IDENTIFIER or TYPENAME. And this is the moment where difficulties occur - conflicts in grammar.

I was trying not to use this messy Yacc grammar for gcc 3.4 (http://yaxx.googlecode.com/svn-history/r2/trunk/gcc-3.4.0/gcc/c-parse.y), but this time I have no idea how to resolve conflicts on my own. I took a look at Yacc grammar:

declarator:
    after_type_declarator
    | notype_declarator
    ;

after_type_declarator:
    ...
    | TYPENAME
    ;

notype_declarator:
    ...
    | IDENTIFIER
    ;

fndef:
    declspecs_ts setspecs declarator
    // some action code
    // the rest of production
...

setspecs: /* empty */
    // some action code

declspecs_ts means declaration_specifiers where "Whether a type specifier has been seen; after a type specifier, a typedef name is an identifier to redeclare (_ts or _nots)."

From declspecs_ts we can reach

typespec_nonreserved_nonattr:
    TYPENAME
    ...
    ;

At the first glance I can't believe how shift/reduce conflicts does not appear! setspecs is empty, so we have declspecs_ts followed by declarator, so that we can expect that parser should be confused whether TYPENAME is from declspecs_ts or from declarator.

Can anyone explain this briefly (or even precisely). Thanks in advance!

EDIT: Useful link: http://www.gnu.org/software/bison/manual/bison.html#Semantic-Tokens

Tachometer answered 19/6, 2013 at 22:23 Comment(0)
E
3

I can't speak for the specific code.

But the basic trick is that the C lexer inspects every IDENTIFIER, and decides if might be the name of a typedef. If so, then it changes the lexeme type to TYPEDEF and hands it to the parser.

How is the lexer to know what identifiers are typedefs? The parser must in effect tell it, by capturing typedef information as it runs. Somewhere in the grammar related to declarations, there must be an action to provide this information. I would have expected it to be attached to the grammar rules for, well, typedef declarations.

You didn't show what "setspec" did; maybe that's the place. A common trick used with LR parser generators is to introduce a grammar rule E with an empty right hand (your example "setspec"?), to be invoked in the middle of some other grammar rule (your example "fndef") just to enable access to a semantic action in the middle of processing that rule.

This whole trick is to get around parsing ambiguity if you can't tell typedefs from other identifiers. If your parser tolerates ambiguity, you don't need this hack at all; just parse, and built ASTs with both (sub)parses. After you acquire the AST, a tree walk can find type information and eliminate inconsistent subparses. We do this with GLR for both C and C++, and it beautifully separates parsing from name resolution.

Eggnog answered 20/6, 2013 at 10:0 Comment(5)
You're right, but in this case something different happens. You can read setspec definition in link above code snippet. I looked a bit deeper in this code. declspecs_ts contains EXACTLY ONE TYPENAME and often some other specifiers (qualifiers like INLINE etc.). There are also other variants like declspecs_nots and we have to combine different declspecs with after_type_declarator and notype_declaraor to support all possible combinations. This is even more complicated (because of attribues), nevertheless it is organized/split smart enough to prevent conflicts.Tachometer
Welcome to production grammars, which have not only the "standard" langauge constructs, but all the other junk that a specific compiler/dialect (e.g., GCC, MS, GreenHills) throws into the pile. When you add all the extensions that people do, the grammars tend to stop being simple. If you are lucky, the grammar is still "organized/split smart enough" to be manageable. GLR parsers actually make doing this for big complicated grammars quite a bit easier, because don't have to entangle stuff like TYPEDEF checking the grammar itself.Eggnog
This is how I see it: declspecs_ts after_type_declaration is for case when typename is redeclared as variable (possibly of some other typename2), for example: typedef int myType1; typedef float myType2; { myType1 myType2; /* variable myType2 (name) of type myType1 */ } declspecs_nots notype_declaration is for simple int foo;Tachometer
Ok, can you recommend any GLR parser for Java?Tachometer
Consider strategoxt.org/Stratego/JSGLR I have no specific experience with this, but the Stratego guys are a pretty good act. Wouldn't surprise me if you found a C grammar nearby.Eggnog

© 2022 - 2024 — McMap. All rights reserved.