What do flex and bison need from each other?

The most important thing in the bison-generated header are the enum values used to identify the token types (which are the values returned to the parser by lexical actions).

The header also declares the YYSTYPE semantic type, and the variable yylval (which has that type), used to communicate the semantic value of each token to the parser. (At least, for tokens which have semantic values.) Similarly, if the parser uses location information, the header defines the YYLTYPE location type, and the variable yylloc of that type.

Since header dependencies cannot be circular, the parser does not have any header dependency on the scanner. It is for this reason that your bison input file must include a declaration of yylex.

That's fine for the classical interface between parser and scanner, which use global variables to communicate. And it also more or less works with a re-entrant ("pure") parser, in which the semantic value (and location, if used) are communicated through arguments to yylex, although the fact that the declaration of yylex is not automatic is more annoying.

Where it starts to break down is when the scanner is also re-entrant. In that case, the parser must call the scanner with an opaque scanner context object of type yyscan_t. yyscan_t "belongs to" the scanner, so it can only be defined in a header for the scanner. But as noted above, that would lead to a circular dependency chain. This reveals the weakness in the traditional model: the parser is a client of the scanner, so the fact that the scanner depends on the parser to define essential datastructures is a dependency inversion.

This is a very real problem because the public interface for the re-entrant scanner includes functions whose prototypes require parser-specific datatypes (YYSTYPE and YYLTYPE), while the parser prototype will almost certainly need to accept a scanner context object as an argument, so it cannot be declared without the scanner-specific datatype yyscan_t.

The usual solution to this problem is to break encapsulation by noting that yyscan_t is simply a void*, so it can be declared as such in the parser, thereby avoiding the need for the parser to #include the scanner header (as long as it does not require access to any other public methods declared in that header).

A less ugly solution, in my opinion, is to avoid this particular configuration altogether. Bison lets you request a reentrant "push parser", and that can be used along with a reentrant scanner without any of the above complications.

In the push parser model, the scanner is the top level function called to parse the file, and the scanner calls the parser each time it identifies a token. The header dependencies are no longer circular because the parser has no need to know anything about the scanner's context object, or for that matter about the prototype of yylex(). The scanner is now a client of the parser, and thus has a natural header dependency on the parser so the fact that the parser defines the token enums and semantic and location datatypes is no longer exceptional.

As well as simplifying the header dependencies between the two components, the push parser often simplifies control flow inside the scanner itself. In many use cases, a single scanner pattern will result in the identification of multiple tokens. In the traditional model, the scanner must keep a queue of tokens, releasing them one at a time when called by the parser. But in the push model, a scanner action can simply call the parser several times, once for each identified token. This model was popularised by the Lemon parser generator (part of sqlite3) and subsequently implemented by other parser generators including bison.

Recommended topics

Hot tags