I suspect you're going to have to do this without so much reliance on precedence declarations. But I haven't completely given up hope, yet :-) So I'll start by just laying out how precedence works, by way of showing why it won't work in the way you are trying to use it.
1. A precedence comparison is always between a rule and a token
The basic mantra about precedence rules is very simple: a precedence comparison always involves a rule (or production, like expr: expr '+' expr
) and an incoming token, called the lookahead token. There are no exceptions. Although the form of declaring precedence levels makes it look like it is a comparison between two tokens, that is a convenient fiction which makes it a little more convenient to use in common cases. But the reality is, as I said before (and which bears repeating): A precedence comparison is always between a rule and a token.
To understand what that implies, it's useful to understand the nature of the LR parsing algorithm. An LR parser is a finite-state pushdown automaton, which means that it is an ordinary finite-state machine augmented with a single stack. In the case of LR parsing, the stack consists entirely of state IDs. A state of the automaton corresponds to a set of "items"; an item consists of a production rule and position in the production rule. In effect, a state represents a set of possible parse positions, all of which are being explored in parallel.
Every time the parser makes an ordinary state transition (in which an input token is read and the rules are used to move to the next state), the target state is also pushed onto the stack. This is called a "shift" transition, because the input token is shifted onto the parser. This can only happen if there is one or more items in the state's item set in which the lookahead token is either the terminal immediately following the position or one of the tokens which could start the non-terminal immediately following the position.
But there is also another kind of transition: a "reduction" transition. A reduction transition is how the parser recognizes that a production rule has been identified. (It's called a reduction because it reduces the right-hand side of the production by replacing it with the non-terminal on its left-hand side.) To perform this reduction, the automaton does two things: first, it pops one state off of the stack for each symbol on the right-hand side of the rule. Second, it shifts the non-terminal by using the transition rule for that non-terminal (and, as with a shift, pushing the resulting state ID onto the stack.)
Although a reduction transition does not absorb the lookahead token, it does take it into account. For the reduction transition to be feasible, it must be possible to shift the lookahead token after the reduction (or reductions, since more than one might be possible). These lookahead sets are computed during the construction of the parsing automaton; the state machines are all static.
So a shift transition corresponds to the decision that it is not yet possible to recognise any right-hand side, whereas a reduction transition corresponds to the decision that some production has been recognised.
Sometimes it will occur that both a shift and a reduction are available: that is, the parser state is at the end of some production, but it is also at a point in a different production in which the lookahead token is one of the possible next tokens.
This is called a "shift-reduce conflict", because both a shift and a reduce are possible. To resolve this conflict, the parser generator (not the parser) eliminates one of the transitions from the state's transition table. If there are no applicable precedence relations, the reduce action is eliminated. (In other words, the parser prefers to shift.) But if a configured precedence is available, the parser generator uses it by comparing the precedence level of the available reduction with the precedence level of the lookahead token. Whichever is greater wins (ties are resolved with associativity).
You can actually see the precedence rules at work if you use a recent version of bison and provide the --report=all
option, which shows a bit more information than the -v
option. (In both cases, the report is written to <filename>.output
unless you supply a custom report filename.) I encourage you to do this.
2. Precedence is not inherited
A consequence of the static nature of precedence decisions is that precedence is not inherited through a reduction. We can see that through a very simple example.
We start with this trivial grammar:
%token NUMBER
%left '+'
%%
expr: NUMBER
| expr '+' expr
This leads to a machine with five states, of which the last one is of particular interest: (excerpt from the file precedence.output
after bison --report=all precedence.y
)
State 5
2 expr: expr . '+' expr
2 | expr '+' expr . [$end, '+']
$default reduce using rule 2 (expr)
Conflict between rule 2 and token '+' resolved as reduce (%left '+').
What we see here is that the parser has reached a state where it is possible to advance the .
(which represents the parse progress) by shifting a +
, or to wait until after reducing expr '+' expr
. Because addition is left-associative, the reduction is correct; this will cause 2 + 3 · + 4
to immediately reduce to expr · + 4
, which is equivalent to saying that the parse is effectively (2 + 3) + 4
.
Now, let's add a level of indirection:
%token NUMBER
%left '+'
%%
expr : NUMBER
| left '+' right
left : expr
right: expr
In the new machine, State 5 is a bit different:
State 5
1 expr: . NUMBER
2 | . left '+' right
2 | left '+' . right
3 left: . expr
4 right: . expr
NUMBER shift, and go to state 1
expr go to state 6
left go to state 3
right go to state 7
Now there is no conflict at all, because left
and right
are different non-terminals. So there is no need for the precedence rule at all, and it turns out to be unused. But note that in this new machine's state 5, the parser recognizes that it might be about to parse a left
or about to parse a right
(in the last two rules numbered 3 and 4). And here is the rub, in state 6:
State 6
3 left: expr . ['+']
4 right: expr . [$end, '+']
$end reduce using rule 4 (right)
'+' reduce using rule 3 (left)
'+' [reduce using rule 4 (right)]
$default reduce using rule 3 (left)
Once it manages to parse an expr
, it needs to decide whether it's a left
or a right
. Here the conflict is between two different reductions, so it is a reduce-reduce conflict. And since precedence always compares a rule with a terminal, it does not apply to a situation where two rules need to be compared. So the conflict is not resolved with precedence.
(The conflict is resolved using yacc/bison's default resolution algorithm for reduce/reduce conflicts: Choose the rule which comes first in the file.)
So if the left and right operands of the +
operation really have different grammars which overlap, we're going to have a hard time resolving the ambiguity with precedence declarations.
At this point, we might just give up on precedence (which we might have to do in the end anyway) but I thought it would be worthwhile trying something which might work. It didn't work perfectly, but the attempt was interesting enough that I thought it worthwhile presenting.
3. A brief exploration of the issues
I'm sure you've gotten this far yourself, because your grammar seems to include some of the usually-suggested workarounds for LR(2) grammars. But it seemed useful to try to reduce the problem to a minimum in order to be clear about the possible solutions.
In reality, there are three distinct issues here:
the "short closure" syntax is LR(2); that is, it requires two tokens of lookahead to decide between two different reductions;
the =>
token is used in two mutually incompatible ways, making it necessary to define two different expr
syntaxes depending on context;
the preface to the short closure -- the parameter list and following =>
-- has asymmetric precedence.
The third issue is not much different from the syntax of the yield
prefix operator, for which the grammar already has a solution (whether or not it is what the language designers and/or users would have wished for), so I'm going to leave that for later (or perhaps for a different essay) and focus on the first two issues. [Note 2]
The essence of the problem lies in the following two snippets of code (actually, I'm only interested in the expr
following the assignment operator, but it seemed more readable to provide a full context):
b = ( a ) + 2
b = ( a ) => 2
For the rest of this exposition, we'll assume that the parser has just read the token a
.
These are both special cases, each of a different syntax, which are roughly:
expr : expr '+' expr
and
expr : parameter_list "=>" expr
For completeness, we also need to see:
expr : '(' expr ')'
| ID
parameter_list : '(' ')' | '(' parameters ')'
parameters : parameters ',' ID
| ID
Other instances of these two syntaxes are unproblematic:
b = ( a + 3 ) + 2
b = ( a , c ) => 2
Here ( a + 3 )
cannot be a parameter_list
and ( a , c )
cannot be an expr
, so only one of the rules applies in each case, and the +
and ,
tokens are each sufficient to exclude the other alternative. But in the case of ( a )
(with the parser looking at the )
token), it is not yet possible to know which way to jump.
Unfortunately, the parser needs to know this, because it has to choose between:
expr : ID
parameters : ID
It needs to proceed with one of the rules:
expr : '(' expr ')'
or
parameter_list : '(' parameters ')'
but to do so it must choose between on of the two ID
reductions. Since that decision cannot be made based only on one lookahead token, bison reports a reduce/reduce conflict, and as we've seen above reduce/reduce conflicts cannot be resolved with precedence level declarations.
4. A partial solution
If the parser could look one more token into the future, it would see the token following the )
, which would be sufficient to make a decision: if the second next token is =>
then it must be in a parameter_list
; otherwise, it must be in an expr
. So the grammar (or a simplified version of it) is LR(2), not LR(1). It would be nice if bison could generate LR(2) grammars, but it can't. [Note 1] So we need to find a different solution.
There is a different solution, because there is no such thing as an LR(2) language. It's easy to prove that any language with an LR(k) grammar also has an equivalent LR(1) grammar -- equivalent in the sense that the original parse tree can be mechanically derived from the parse tree for the LR(1) grammar. This equivalent grammar can even be generated algorithmically, so a mathematician might say "a solution exists". Unfortunately it's not a particularly practical solution, because there are no tools (that I know of) which actually do the transform -- and certainly bison doesn't -- and because the transform makes the grammar much, much larger. Still, the fact that an LR(1) grammar must exist makes it worthwhile trying to find one.
The basic approach to turning an LR(2) grammar into an LR(1) grammar is to delay the decision. In the actual grammar, the problem is that the parser needs to decide between parameter_list
and expr
before it has enough information to do so; we can make it's job easier by rewriting the grammar so that the decision can be made later.
We can start with the following minimum grammar, as above:
%token ID
%right "=>"
%left '+'
%%
expr : expr '+' expr
| parameter_list "=>" expr
| '(' expr ')'
| ID
parameter_list : '(' ')' | '(' parameters ')'
parameters : parameters ',' ID
| ID
Rather similar to the "left/right" example above, this grammar has a reduce/reduce conflict in state 5:
State 5
4 expr: ID . ['+', ')']
8 parameters: ID . [')', ',']
')' reduce using rule 4 (expr)
')' [reduce using rule 8 (parameters)]
',' reduce using rule 8 (parameters)
$default reduce using rule 4 (expr)
As a first approximation to a solution, we can add a few simple redundant rules:
%token ID
%right "=>"
%left '+'
%%
expr : paren_id_paren
parameter_list : paren_id_paren
paren_id_paren : '(' ID ')'
expr : expr '+' expr
| parameter_list "=>" expr
| '(' expr ')'
| ID
parameter_list : '(' ')' | '(' parameters ')'
parameters : parameters ',' ID
| ID
Running this through bison shows that we now have a state with a three-way conflict (shift/reduce/reduce):
State 6
3 paren_id_paren: '(' ID . ')'
7 expr: ID . ['+', ')']
11 parameters: ID . [')', ',']
')' shift, and go to state 13
')' [reduce using rule 7 (expr)]
')' [reduce using rule 11 (parameters)]
',' reduce using rule 11 (parameters)
$default reduce using rule 7 (expr)
This is the state where the parser has just read '(' ID
and the lookahead token is )
. Because the new grammar is ambiguous, every input containing this sequence can be parsed in two ways: either with the shift or with one of the two reductions. The parser still cannot tell which reduction to use. But the shift always works, and because bison/yacc's default conflict resolution algorithm is to prefer to shift, the shift is what has been baked into the parse automaton. And that's great, because it is exactly what we want. The only downside is that the parser generator will produce a warning every time it is run, and some people really hate seeing warnings during a production build.
I don't mean to disparage the distaste for warnings; I share it. But I also note that this sort of solution was actually contemplated by the original authors of yacc, and an example of it is even described in the Dragon book in the section about using yacc for ambiguous grammars. It's precisely why the yacc default conflict resolution algorithm works the way it does. Bison and yacc even implement a pair of directives whose purpose is to silence this warning when it is expected. So we could just leave it at that and go on to the other issue (the dual use of =>
), but while I was thinking about this question, it occurred to me that it might be possible to use precedence levels to provide an explicit resolution, in accordance with the Bison manual's recommendation:
we don’t recommend the use of %expect
(except ‘%expect 0’
!), as an equal number of conflicts does not mean that they are the same. When possible, you should rather use precedence directives to fix the conflicts explicitly. (Emphasis in original.)
The precedence declaration needs to prefer shifting a )
to reducing ID
. Put that way, the declaration is straight-forward:
%token ID
%precedence ID
%precedence ')'
%right "=>"
%left '+'
%%
expr : paren_id_paren
parameter_list : paren_id_paren
paren_id_paren : '(' ID ')'
expr : expr '+' expr
| parameter_list "=>" expr
| '(' expr ')'
| ID
parameter_list : '(' ')' | '(' parameters ')'
parameters : parameters ',' ID
| ID
That works fine, so we can move on to the second issue and see if the solution holds up in context.
Still to be continued
Notes
In fact, yacc generates LALR(1) grammars, which are slightly more limited in their use of lookahead but the difference between LR(1) and LALR(1) need not trouble us here.
Bison is capable of generating GLR grammars, which will work with any unambiguous grammar, and this grammar is unambiguous. However, many projects are reluctant to use GLR grammars because of perceived inefficiencies and because of restrictions on actions. If that's not the case here, using a GLR grammar is far and away the simplest solution.
The pre-existing use of =>
has reasonably well-defined precedence, which is fully specified by the pre-existing precedence level declarations.
yield
inside an array list" example? My impression is that php will parse[yield $a => 2 + 2]
as yielding a key/value pair and then creating a an array list with the value sent into the generator. When you wrote[(T_YIELD T_IDENTIFIER) => (T_IDENTIFIER + (T_IDENTIFIER => (T_YIELD T_IDENTIFIER)))]
, I interpreted that as meaning that theyield
just yields the value of the first identifier, and the first fat arrow is part of the array list syntax, which doesn't seem to me to be the existing parse. – Gib[(T_YIELD T_IDENTIFIER => (T_IDENTIFIER + (T_IDENTIFIER => (T_YIELD T_IDENTIFIER))))]
. I must have simplified the example too much, try[T_YIELD T_IDENTIFIER => T_IDENTIFIER => T_IDENTIFIER + T_IDENTIFIER => T_YIELD T_IDENTIFIER]
and[(T_YIELD T_IDENTIFIER => T_IDENTIFIER) => (T_IDENTIFIER + (T_IDENTIFIER => (T_YIELD T_IDENTIFIER)))]
- I wanted to emphasize that the righthand part of the leftmost yield must not be parsed as an "arrow function". – Thirsty