OCaml + Menhir: How to parse OCaml like tuple-pattern?

Asked 28/10, 2015 at 16:7 Answered 29/10, 2015 at 3:2

I'm a quite beginner of menhir. I'm wondering how to parse OCaml like tuple-pattern in my own language, which is quite similar to OCaml.

For example, in the expression let a,b,c = ..., a, b, c should be parsed like Tuple (Var "a", Var "b", Var "c").

But, in following definition of parser, the above example is parsed as Tuple (Tuple (Var "a", Var "b"), Var "c"). I am wondering how to fix the following definition to parse patterns like ocaml.

I have checked OCaml's parser.mly, but I'm not sure how to implement to do that. I think my definition is similar to OCaml's one ... What a magic they use ?

%token LPAREN
%token RPAREN
%token EOF
%token COMMA
%left COMMA
%token <string> LIDENT
%token UNDERBAR
%nonassoc below_COMMA

%start <Token.token> toplevel
%%

toplevel:
| p = pattern EOF { p }

pattern:
| p = simple_pattern        { p }
| psec = pattern_tuple %prec below_COMMA
  { Ppat_tuple (List.rev psec) }

simple_pattern:
| UNDERBAR                  { Ppat_any }
| LPAREN RPAREN             { Ppat_unit }
| v = lident                { Ppat_var v }
| LPAREN p = pattern RPAREN { p }

pattern_tuple:
| seq = pattern_tuple; COMMA; p = pattern { p :: seq }
| p1 = pattern; COMMA; p2 = pattern { [p2; p1] }

lident:
| l = LIDENT { Pident l }

The result is the following:

[~/ocaml/error] menhir --interpret --interpret-show-cst ./parser.mly
File "./parser.mly", line 27, characters 2-42:
Warning: production pattern_tuple -> pattern_tuple COMMA pattern is never reduced.
Warning: in total, 1 productions are never reduced.
LIDENT COMMA LIDENT COMMA LIDENT
ACCEPT
[toplevel:
  [pattern:
    [pattern_tuple:
      [pattern:
        [pattern_tuple:
          [pattern: [simple_pattern: [lident: LIDENT]]]
          COMMA
          [pattern: [simple_pattern: [lident: LIDENT]]]
        ]
      ]
      COMMA
      [pattern: [simple_pattern: [lident: LIDENT]]]
    ]
  ]
  EOF
]

Album answered 28/10, 2015 at 16:7 Comment(4)

Why you do not give the definition of below_COMMA? Is it really below COMMA? – Threewheeler 28/10, 2015 at 16:21

I'm sorry, I added lacking information. – Album 28/10, 2015 at 16:29

Improve the example. Now my definition is a complete file. – Album 28/10, 2015 at 16:52

below_COMMA must come above COMMA to make its precedence below COMMA. Sounds crazy but the precedences must be listed from low to high. – Threewheeler 29/10, 2015 at 0:35

It contains a typical shift-reduce conflict, and you made a mistake at its resolution by specifying precedence. Please open any book about parsing by Yacc and check about shift-reduce conflicts and their resolution.

Let's see it using your rules. Suppose we have the following input and the parser is looking ahead at the second ,:

( p1 , p2 , ...
          ↑
         Yacc is looking at this second COMMA token

Their are two possibilities:

Reduce: take p1 , p2 as a pattern. It uses the second rule of pattern
Shift: consumes the token COMMA and shift the look ahead cursor right to try making the comma separated list longer.

You can see the conflict easily removing %prec below_COMMA from the pattern rule:

$ menhir z.mly   # the %prec thing is removed
...
File "z.mly", line 4, characters 0-9:
Warning: the precedence level assigned to below_COMMA is never useful.
Warning: one state has shift/reduce conflicts.
Warning: one shift/reduce conflict was arbitrarily resolved.

Many Yacc documents say in this case Yacc prefers shift, and this default usually matches with human intention, including your case. So one of the solutions is simply drop %prec below_COMMA thing and forget the warning.

If you do not like having shift reduce conflict warnings (That's the spirit!), you can explicitly state which rule should be chosen in this case using precedences, just like OCaml's parser.mly does. (BTW, OCaml's parser.mly is a jewel box of shift reduce resolutions. If you are not familiar, you should check one or two of them.)

Yacc chooses the rule with higher precedence at a shift reduce conflict. For shift, its precedence is the one of the token at the look ahead cursor, which is COMMA in this case. The precedence of reduce can be declared by %prec TOKEN suffix at the corresponding rule. If you do not specify it, I guess the precedence of a rule is undefined, and this is why the shift reduce conflict is warned if you remove %prec below_COMMA.

So now the question is: which has a higher precedence, COMMA or below_COMMA? This should be declared in the preamble of the mly file. (And this is why I have asked the questioner show that part.)

...
%left COMMA
...
%nonassoc below_COMMA

I skip what %left and %nonassoc mean, for all the Yacc books should explain them. Here below_COMMA pseudo token is below COMMA. It means below_COMMA has a higher precedence than COMMA. Therefore, the above example chooses Reduce, and gets ( (p1, p2), ... against the intention.

The correct precedence declaration is the opposite. To let Shift happen, below_COMMA must come above of COMMA, to have a lower precedence:

...
%nonassoc below_COMMA
%left COMMA
...

See OCaml's parser.mly. It exactly does like this. Putting "below thing above" sounds totally crazy, but it is not Menhir's fault. It is Yacc's unfortunate tradition. Blame it. OCaml's parser.mly has a comment already:

The precedences must be listed from low to high.

Threewheeler answered 29/10, 2015 at 3:2 Comment(0)

This is how I would rewrite it:

%token LPAREN
%token RPAREN
%token EOF
%token COMMA
%token  LIDENT
%token UNDERBAR

%start  toplevel
%%

toplevel:
| p = pattern EOF { p }

pattern:
| p = simple_pattern; tail = pattern_tuple_tail
      { match tail with
        | [] -> p
        | _ -> Ppat_tuple (p :: tail)
      }

pattern_tuple_tail:
| COMMA; p = simple_pattern; seq = pattern_tuple_tail { p :: seq }
|                                                     { [] }

simple_pattern:
| UNDERBAR                  { Ppat_any }
| LPAREN RPAREN             { Ppat_unit }
| v = lident                { Ppat_var v }
| LPAREN p = pattern RPAREN { p }

lident:
| l = LIDENT { Pident l }

A major change is that a tuple element is no longer a pattern but a simple_pattern. A pattern itself is a comma-separated sequence of one or more elements.

$ menhir --interpret --interpret-show-cst parser.mly 
LIDENT COMMA LIDENT COMMA LIDENT
ACCEPT
[toplevel:
  [pattern:
    [simple_pattern: [lident: LIDENT]]
    [pattern_tuple_tail:
      COMMA
      [simple_pattern: [lident: LIDENT]]
      [pattern_tuple_tail:
        COMMA
        [simple_pattern: [lident: LIDENT]]
        [pattern_tuple_tail:]
      ]
    ]
  ]
  EOF
]

Haws answered 28/10, 2015 at 20:20 Comment(0)

Recommended topics

Hot tags