How can I manage mutual recursion, retaining associativity rules?

Asked 5/10, 2018 at 14:4 Answered 6/10, 2018 at 18:59

The overall question is:

How does my grammar have to look like to allow for arbitrarily nested expr := '(' expr ')' => expr | expr_without_short_closure and expr_without_short_closure := [expr_without_short_closure => expr] | yield expr_without_short_closure => expr | expr_without_short_closure 'or' expr | '(' expr ')', while still allowing for low-precedence left-associative operators like expr_without_short_closure 'or' expr?

Currently the LALR(1) bison grammar is structured as follows (presenting a self-contained part of the actual grammar file, simplified a little):

%left ','
%left T_LOGICAL_OR /* or */
%right T_YIELD
%right T_DOUBLE_ARROW /* => */
%left '+'

expr: /* entry point as well */
                expr_without_short_closure %prec ',' { $$ = $1; }
        |       expr_with_short_closure { $$ = $1; }
;

expr_with_short_closure:
                short_closure
        |       T_YIELD expr_without_short_closure T_DOUBLE_ARROW expr_with_short_closure { $$ = zend_ast_create(ZEND_AST_YIELD, $4, $2); }
;

short_closure:
                T_IDENTIFIER T_DOUBLE_ARROW expr { /* ... */ }
        |       '(' expr ')' T_DOUBLE_ARROW expr { /* ... */ }
;

expr_without_short_closure:
                T_IDENTIFIER %prec T_DOUBLE_ARROW { $$ = $1; }
        |       '(' expr ')' %prec T_DOUBLE_ARROW { $$ = $2; }
        |       T_YIELD expr_without_short_closure { $$ = zend_ast_create(ZEND_AST_YIELD, $2, NULL); }
        |       '[' array_pair_list ']' { $$ = $2; }
        |       T_YIELD expr_without_short_closure T_DOUBLE_ARROW expr_without_short_closure { $$ = zend_ast_create(ZEND_AST_YIELD, $4, $2); }
        |       expr_without_short_closure T_LOGICAL_OR expr_without_short_closure       { $$ = zend_ast_create_binary_op(ZEND_AST_OR, $1, $3); }
        |       expr_without_short_closure '+' expr_without_short_closure       { $$ = zend_ast_create_binary_op(ZEND_ADD, $1, $3); }
/*      | and about thirty similar alternate rules like the previous one */
;

non_empty_array_pair_list:
                non_empty_array_pair_list ',' array_pair { $$ = zend_ast_list_add($1, $3); }
        |       array_pair { $$ = zend_ast_create_list(1, ZEND_AST_ARRAY, $1); }
;

array_pair:
                expr_without_short_closure T_DOUBLE_ARROW expr
                        { $$ = zend_ast_create(ZEND_AST_ARRAY_ELEM, $3, $1); }
        |       expr_without_short_closure
                        { $$ = zend_ast_create(ZEND_AST_ARRAY_ELEM, $1, NULL); }
;

Essentially I'm trying to introduce "arrow functions", containing a parameter list on the left and an arbitrary expression on the right with a T_DOUBLE_ARROW in the middle.

Now the challenge is that the T_DOUBLE_ARROW token is already used in two places, namely the expr_without_short_closure T_DOUBLE_ARROW expr alternation in the array_pair rule and the T_YIELD expr_without_short_closure T_DOUBLE_ARROW expr_without_short_closure in the expr_without_short_closure.

This current grammar sort of works, but it (obviously) fails to parse for example:

[T_YIELD T_IDENTIFIER => T_IDENTIFIER => T_IDENTIFIER + T_IDENTIFIER => T_YIELD T_IDENTIFIER]
// must be grouped as:
[(T_YIELD T_IDENTIFIER => T_IDENTIFIER) => (T_IDENTIFIER + (T_IDENTIFIER => (T_YIELD T_IDENTIFIER)))]

In this case the expr_without_short_closure '+' expr_without_short_closure alternative fails as this only accepts expr_without_short_closure on the right-hand side, obviously disallowing short_closure there.

However, I cannot simply replace expr_without_short_closure by expr there as this conflicts with the expr_without_short_closure T_DOUBLE_ARROW expr expression (array_pair rule) or the T_YIELD expr_without_short_closure T_DOUBLE_ARROW expr_without_short_closure expression (expr_without_short_closure rule).

Now, I could try to put expr only the right-hand side of the expressions. This is fine, except for left-associative operations. Now, suddenly T_IDENTIFIER + T_IDENTIFIER T_LOGICAL_OR T_IDENTIFIER gets grouped as T_IDENTIFIER + (T_IDENTIFIER T_LOGICAL_OR T_IDENTIFIER) instead of the desired (T_IDENTIFIER + T_IDENTIFIER) T_LOGICAL_OR T_IDENTIFIER. (Why?)

Also I'd love to avoid the %prec in expr_without_short_closure %prec ',' (expr rule). For some reason it is needed (removing it causes a shift/reduce conflict on every single rule in expr_without_short_closure) and I guess there's also the root of my problem, though I don't really understand why (searching yielded answers like "associativity rules do not pass through indirection" - but I don't really see how I could avoid indirection at all here).

I'm trying to keep the question self-contained, but in case I missed something - the actual grammar file can be found at https://github.com/bwoebi/php-src/blob/0d98d8060bde88ac2e5904cb55ecb13d15316053/Zend/zend_language_parser.y#L898 - I think it's obvious that one does not really want to duplicate all the rules from expr_without_short_closure into expr_with_short_closure (and I'm not even sure whether that's really going to help with the low-precedence left associative operators).

Thirsty answered 5/10, 2018 at 14:4 Comment(2)

Can you clarify what you meant by "must be grouped as" in the "yield inside an array list" example? My impression is that php will parse [yield $a => 2 + 2] as yielding a key/value pair and then creating a an array list with the value sent into the generator. When you wrote [(T_YIELD T_IDENTIFIER) => (T_IDENTIFIER + (T_IDENTIFIER => (T_YIELD T_IDENTIFIER)))], I interpreted that as meaning that the yield just yields the value of the first identifier, and the first fat arrow is part of the array list syntax, which doesn't seem to me to be the existing parse. – Gib 6/10, 2018 at 23:54

@Gib You are right, this would be grouped as [(T_YIELD T_IDENTIFIER => (T_IDENTIFIER + (T_IDENTIFIER => (T_YIELD T_IDENTIFIER))))]. I must have simplified the example too much, try [T_YIELD T_IDENTIFIER => T_IDENTIFIER => T_IDENTIFIER + T_IDENTIFIER => T_YIELD T_IDENTIFIER] and [(T_YIELD T_IDENTIFIER => T_IDENTIFIER) => (T_IDENTIFIER + (T_IDENTIFIER => (T_YIELD T_IDENTIFIER)))] - I wanted to emphasize that the righthand part of the leftmost yield must not be parsed as an "arrow function". – Thirsty 7/10, 2018 at 8:23

I suspect you're going to have to do this without so much reliance on precedence declarations. But I haven't completely given up hope, yet :-) So I'll start by just laying out how precedence works, by way of showing why it won't work in the way you are trying to use it.

1. A precedence comparison is always between a rule and a token

The basic mantra about precedence rules is very simple: a precedence comparison always involves a rule (or production, like expr: expr '+' expr) and an incoming token, called the lookahead token. There are no exceptions. Although the form of declaring precedence levels makes it look like it is a comparison between two tokens, that is a convenient fiction which makes it a little more convenient to use in common cases. But the reality is, as I said before (and which bears repeating): A precedence comparison is always between a rule and a token.

To understand what that implies, it's useful to understand the nature of the LR parsing algorithm. An LR parser is a finite-state pushdown automaton, which means that it is an ordinary finite-state machine augmented with a single stack. In the case of LR parsing, the stack consists entirely of state IDs. A state of the automaton corresponds to a set of "items"; an item consists of a production rule and position in the production rule. In effect, a state represents a set of possible parse positions, all of which are being explored in parallel.

Every time the parser makes an ordinary state transition (in which an input token is read and the rules are used to move to the next state), the target state is also pushed onto the stack. This is called a "shift" transition, because the input token is shifted onto the parser. This can only happen if there is one or more items in the state's item set in which the lookahead token is either the terminal immediately following the position or one of the tokens which could start the non-terminal immediately following the position.

But there is also another kind of transition: a "reduction" transition. A reduction transition is how the parser recognizes that a production rule has been identified. (It's called a reduction because it reduces the right-hand side of the production by replacing it with the non-terminal on its left-hand side.) To perform this reduction, the automaton does two things: first, it pops one state off of the stack for each symbol on the right-hand side of the rule. Second, it shifts the non-terminal by using the transition rule for that non-terminal (and, as with a shift, pushing the resulting state ID onto the stack.)

Although a reduction transition does not absorb the lookahead token, it does take it into account. For the reduction transition to be feasible, it must be possible to shift the lookahead token after the reduction (or reductions, since more than one might be possible). These lookahead sets are computed during the construction of the parsing automaton; the state machines are all static.

So a shift transition corresponds to the decision that it is not yet possible to recognise any right-hand side, whereas a reduction transition corresponds to the decision that some production has been recognised.

Sometimes it will occur that both a shift and a reduction are available: that is, the parser state is at the end of some production, but it is also at a point in a different production in which the lookahead token is one of the possible next tokens.

This is called a "shift-reduce conflict", because both a shift and a reduce are possible. To resolve this conflict, the parser generator (not the parser) eliminates one of the transitions from the state's transition table. If there are no applicable precedence relations, the reduce action is eliminated. (In other words, the parser prefers to shift.) But if a configured precedence is available, the parser generator uses it by comparing the precedence level of the available reduction with the precedence level of the lookahead token. Whichever is greater wins (ties are resolved with associativity).

You can actually see the precedence rules at work if you use a recent version of bison and provide the --report=all option, which shows a bit more information than the -v option. (In both cases, the report is written to <filename>.output unless you supply a custom report filename.) I encourage you to do this.

2. Precedence is not inherited

A consequence of the static nature of precedence decisions is that precedence is not inherited through a reduction. We can see that through a very simple example.

We start with this trivial grammar:

%token NUMBER
%left '+'
%%
expr: NUMBER
    | expr '+' expr

This leads to a machine with five states, of which the last one is of particular interest: (excerpt from the file precedence.output after bison --report=all precedence.y)

State 5

    2 expr: expr . '+' expr
    2     | expr '+' expr .  [$end, '+']

    $default  reduce using rule 2 (expr)

    Conflict between rule 2 and token '+' resolved as reduce (%left '+').

What we see here is that the parser has reached a state where it is possible to advance the . (which represents the parse progress) by shifting a +, or to wait until after reducing expr '+' expr. Because addition is left-associative, the reduction is correct; this will cause 2 + 3 · + 4 to immediately reduce to expr · + 4, which is equivalent to saying that the parse is effectively (2 + 3) + 4.

Now, let's add a level of indirection:

%token NUMBER
%left '+'
%%
expr : NUMBER
     | left '+' right
left : expr
right: expr

In the new machine, State 5 is a bit different:

State 5

    1 expr: . NUMBER
    2     | . left '+' right
    2     | left '+' . right
    3 left: . expr
    4 right: . expr

    NUMBER  shift, and go to state 1

    expr   go to state 6
    left   go to state 3
    right  go to state 7

Now there is no conflict at all, because left and right are different non-terminals. So there is no need for the precedence rule at all, and it turns out to be unused. But note that in this new machine's state 5, the parser recognizes that it might be about to parse a left or about to parse a right (in the last two rules numbered 3 and 4). And here is the rub, in state 6:

State 6

    3 left: expr .  ['+']
    4 right: expr .  [$end, '+']

    $end      reduce using rule 4 (right)
    '+'       reduce using rule 3 (left)
    '+'       [reduce using rule 4 (right)]
    $default  reduce using rule 3 (left)

Once it manages to parse an expr, it needs to decide whether it's a left or a right. Here the conflict is between two different reductions, so it is a reduce-reduce conflict. And since precedence always compares a rule with a terminal, it does not apply to a situation where two rules need to be compared. So the conflict is not resolved with precedence.

(The conflict is resolved using yacc/bison's default resolution algorithm for reduce/reduce conflicts: Choose the rule which comes first in the file.)

So if the left and right operands of the + operation really have different grammars which overlap, we're going to have a hard time resolving the ambiguity with precedence declarations.

At this point, we might just give up on precedence (which we might have to do in the end anyway) but I thought it would be worthwhile trying something which might work. It didn't work perfectly, but the attempt was interesting enough that I thought it worthwhile presenting.

3. A brief exploration of the issues

I'm sure you've gotten this far yourself, because your grammar seems to include some of the usually-suggested workarounds for LR(2) grammars. But it seemed useful to try to reduce the problem to a minimum in order to be clear about the possible solutions.

In reality, there are three distinct issues here:

the "short closure" syntax is LR(2); that is, it requires two tokens of lookahead to decide between two different reductions;
the => token is used in two mutually incompatible ways, making it necessary to define two different expr syntaxes depending on context;
the preface to the short closure -- the parameter list and following => -- has asymmetric precedence.

The third issue is not much different from the syntax of the yield prefix operator, for which the grammar already has a solution (whether or not it is what the language designers and/or users would have wished for), so I'm going to leave that for later (or perhaps for a different essay) and focus on the first two issues. [Note 2]

The essence of the problem lies in the following two snippets of code (actually, I'm only interested in the expr following the assignment operator, but it seemed more readable to provide a full context):

b = ( a ) + 2
b = ( a ) => 2

For the rest of this exposition, we'll assume that the parser has just read the token a.

These are both special cases, each of a different syntax, which are roughly:

expr : expr '+' expr

and

expr : parameter_list "=>" expr

For completeness, we also need to see:

expr           : '(' expr ')'
               | ID
parameter_list : '(' ')' | '(' parameters ')'
parameters     : parameters ',' ID
               | ID

Other instances of these two syntaxes are unproblematic:

b = ( a + 3 ) + 2
b = ( a , c ) => 2

Here ( a + 3 ) cannot be a parameter_list and ( a , c ) cannot be an expr, so only one of the rules applies in each case, and the + and , tokens are each sufficient to exclude the other alternative. But in the case of ( a ) (with the parser looking at the ) token), it is not yet possible to know which way to jump.

Unfortunately, the parser needs to know this, because it has to choose between:

expr       : ID
parameters : ID

It needs to proceed with one of the rules:

expr           : '(' expr ')'

parameter_list : '(' parameters ')'

but to do so it must choose between on of the two ID reductions. Since that decision cannot be made based only on one lookahead token, bison reports a reduce/reduce conflict, and as we've seen above reduce/reduce conflicts cannot be resolved with precedence level declarations.

4. A partial solution

If the parser could look one more token into the future, it would see the token following the ), which would be sufficient to make a decision: if the second next token is => then it must be in a parameter_list; otherwise, it must be in an expr. So the grammar (or a simplified version of it) is LR(2), not LR(1). It would be nice if bison could generate LR(2) grammars, but it can't. [Note 1] So we need to find a different solution.

There is a different solution, because there is no such thing as an LR(2) language. It's easy to prove that any language with an LR(k) grammar also has an equivalent LR(1) grammar -- equivalent in the sense that the original parse tree can be mechanically derived from the parse tree for the LR(1) grammar. This equivalent grammar can even be generated algorithmically, so a mathematician might say "a solution exists". Unfortunately it's not a particularly practical solution, because there are no tools (that I know of) which actually do the transform -- and certainly bison doesn't -- and because the transform makes the grammar much, much larger. Still, the fact that an LR(1) grammar must exist makes it worthwhile trying to find one.

The basic approach to turning an LR(2) grammar into an LR(1) grammar is to delay the decision. In the actual grammar, the problem is that the parser needs to decide between parameter_list and expr before it has enough information to do so; we can make it's job easier by rewriting the grammar so that the decision can be made later.

We can start with the following minimum grammar, as above:

%token ID
%right "=>"
%left '+'
%%
expr           : expr '+' expr
               | parameter_list "=>" expr
               | '(' expr ')'
               | ID
parameter_list : '(' ')' | '(' parameters ')'
parameters     : parameters ',' ID
               | ID

Rather similar to the "left/right" example above, this grammar has a reduce/reduce conflict in state 5:

State 5

    4 expr: ID .  ['+', ')']
    8 parameters: ID .  [')', ',']

    ')'       reduce using rule 4 (expr)
    ')'       [reduce using rule 8 (parameters)]
    ','       reduce using rule 8 (parameters)
    $default  reduce using rule 4 (expr)

As a first approximation to a solution, we can add a few simple redundant rules:

%token ID
%right "=>"
%left '+'
%%
expr           : paren_id_paren
parameter_list : paren_id_paren
paren_id_paren : '(' ID ')'
expr           : expr '+' expr
               | parameter_list "=>" expr
               | '(' expr ')'
               | ID
parameter_list : '(' ')' | '(' parameters ')'
parameters     : parameters ',' ID
               | ID

Running this through bison shows that we now have a state with a three-way conflict (shift/reduce/reduce):

State 6

    3 paren_id_paren: '(' ID . ')'
    7 expr: ID .  ['+', ')']
   11 parameters: ID .  [')', ',']

    ')'  shift, and go to state 13

    ')'       [reduce using rule 7 (expr)]
    ')'       [reduce using rule 11 (parameters)]
    ','       reduce using rule 11 (parameters)
    $default  reduce using rule 7 (expr)

This is the state where the parser has just read '(' ID and the lookahead token is ). Because the new grammar is ambiguous, every input containing this sequence can be parsed in two ways: either with the shift or with one of the two reductions. The parser still cannot tell which reduction to use. But the shift always works, and because bison/yacc's default conflict resolution algorithm is to prefer to shift, the shift is what has been baked into the parse automaton. And that's great, because it is exactly what we want. The only downside is that the parser generator will produce a warning every time it is run, and some people really hate seeing warnings during a production build.

I don't mean to disparage the distaste for warnings; I share it. But I also note that this sort of solution was actually contemplated by the original authors of yacc, and an example of it is even described in the Dragon book in the section about using yacc for ambiguous grammars. It's precisely why the yacc default conflict resolution algorithm works the way it does. Bison and yacc even implement a pair of directives whose purpose is to silence this warning when it is expected. So we could just leave it at that and go on to the other issue (the dual use of =>), but while I was thinking about this question, it occurred to me that it might be possible to use precedence levels to provide an explicit resolution, in accordance with the Bison manual's recommendation:

we don’t recommend the use of %expect (except ‘%expect 0’!), as an equal number of conflicts does not mean that they are the same. When possible, you should rather use precedence directives to fix the conflicts explicitly. (Emphasis in original.)

The precedence declaration needs to prefer shifting a ) to reducing ID. Put that way, the declaration is straight-forward:

%token ID
%precedence ID
%precedence ')'
%right "=>"
%left '+'
%%
expr           : paren_id_paren
parameter_list : paren_id_paren
paren_id_paren : '(' ID ')'
expr           : expr '+' expr
               | parameter_list "=>" expr
               | '(' expr ')'
               | ID
parameter_list : '(' ')' | '(' parameters ')'
parameters     : parameters ',' ID
               | ID

That works fine, so we can move on to the second issue and see if the solution holds up in context.

Still to be continued

Notes

In fact, yacc generates LALR(1) grammars, which are slightly more limited in their use of lookahead but the difference between LR(1) and LALR(1) need not trouble us here.

Bison is capable of generating GLR grammars, which will work with any unambiguous grammar, and this grammar is unambiguous. However, many projects are reluctant to use GLR grammars because of perceived inefficiencies and because of restrictions on actions. If that's not the case here, using a GLR grammar is far and away the simplest solution.
The pre-existing use of => has reasonably well-defined precedence, which is fully specified by the pre-existing precedence level declarations.

Gib answered 6/10, 2018 at 18:59 Comment(8)

I am aware of what shift and reduce means; I've even looked it up again right before asking the question. I haven't been aware of --report=all, using only -v until now. You're raising a good point with A precedence comparison is always between a rule and a token. (which pretty much describes the issue I'm currently having) - however I'm curious, how I can avoid the need of the extra rule? Waiting for your "to be continued" part :-) – Thirsty 6/10, 2018 at 21:24

@bwoebi: I'm sure you are but not everyone who will read this answer in the future is. Every time I answer a precedence-related question, I try to improve the presentation in the hopes that I won't have to write it again :) But the actual solution to your problem is not simple, so it will take me a while. I'll do it as soon as I can. – Gib 6/10, 2018 at 21:27

Yeah, no problem ;-) Anyway thanks for your effort - I've already tried for a few days to no avail. I just hope the verdict is not "impossible using a LALR(1) parser like bison" in the end. – Thirsty 6/10, 2018 at 21:39

@bwoebi: The language is still LALR(1), but the grammar is tricky. – Gib 6/10, 2018 at 21:40

I've been busy with other things last months, but I still have this issue - it'd be awesome to know a solution to this (type of) problem(s). – Thirsty 11/12, 2018 at 13:49

@bwoebi: i tried to start a chat with you but SO chat is useless. I'll try to get in touch later. – Gib 11/12, 2018 at 19:29

only checked my inbox right now - and yes, I agree, anything but => would've been easier, but that's what people seem to really want :-/ – Thirsty 12/12, 2018 at 12:28

@bwoebi: yeah, I get that. Certainly a grammar exists. The problem will be making the grammar scalable. A preprocessing step might be the best solution to that. – Gib 12/12, 2018 at 14:36

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++