How does backtracking work in peg.js (with example)?
Asked Answered
R

2

11

I've defined the following minimal Peg.js grammar:

start  =  "A1" / "A123"

which you can try in the sandbox.

I would have expected to match "A1" as well as "A123" (according to my notion of how backtracking works). But this is not the case: the grammar recognizes "A1" but not "A123".

Note: I am not looking for the advice "reverse the order of your terms" as in the related question How to transform a simple grammar into something which works in PEG.js (expected "a" but "a" found). Rather, I'm looking to understand the behavior I'm seeing, and why Peg.js's backtracking doesn't apply in this case. For an explanation of why reversing the order of my terms doesn't help, see the more realistic example below.


For a more realistic example, consider units parsing. A grammar should recognize metric units (like "m", "mol") with optional prefixes, like "mm", "mmol", as well as non-metric units like "yr", "week", or "mo".

The following Peg.js grammar won't recognize "mol" because it gets tripped up consuming "mo", and doesn't backtrack. (Changing the order of terms doesn't help; or rather, will cause "mo" to be recognized at the expense of "mol" or "mmol".)

start  =  nonmetric / metric / prefix metric
metric = "mol" / "l" / "m" / "g"
nonmetric = "yr" / "mo" / "week" / "day" / "hour"
prefix = "m" / "k" / "c"

I can do the analagous thing in Antlr with good success:

grammar units;
start  :  nonmetric | metric | prefix metric;
metric : 'mol' | 'l' | 'm' | 'g';
nonmetric : 'yr' | 'mo' | 'week' | 'day' | 'hour';
prefix : 'm' | 'k' | 'c';
Rankle answered 16/7, 2014 at 6:15 Comment(1)
Thanks for the nice examples to this problem when one tries to learn Peg.js coming from Antlr. It really helped me understanding what the hell was wrong with my grammar.Reyesreykjavik
D
17

The problem is with the concept of backtracking. PEG parsers don't backtrack like other recursive-descent parsers or Prolog do. Rather, when confronted with a choice, a PEG parser will try every option until one succeeds. Once one succeeds, it will commit to it no matter how the rule was invoked.

From the Wikipedia article:

Unlike in context-free grammars and regular expressions, however, these operators always behave greedily, consuming as much input as possible and never backtracking.

What you ask for in the complex case is the same that's asked in this question. The answer so far is Yes: you must tweak the rules in PEG grammars to make sure that the longest option always gets matched first, even if the result is a somewhat uglier grammar.

One way to tweak PEG grammars is to use lookaheads (that's one of the main reasons why lookaheads are featured in PEG):

start  =  nonmetric / metric / prefix metric
metric = "mol" / "l" / !"mo" "m" / "g"
nonmetric = "yr" / !"mol" "mo" / "week" / "day" / "hour"
prefix = !("mol"/"mo") "m" / "k" / "c"
Dolores answered 17/7, 2014 at 17:9 Comment(6)
Thanks for the background, clear explanation, and description of lookaheads w/ example!Rankle
Thank you for the explanation. For someone with little background in parsers, are there any alternatives you recommend that do offer backtracking? Antlr seems to be the next choiceGroup
ANTLR is predictive LL(*). It doesn't quite do backtracking, but it can handle a vast variety of parsing cases. antlr.org/papers/allstar-techreport.pdfDolores
@Dolores I've added a link to this comment in the PEG.js readmeLobectomy
@ChristofferBubach, see en.wikipedia.org/wiki/Parsing_expression_grammarDolores
Nice, @Futago-zaRyuu!Dolores
J
3

This is by design. It's up to you to specify right order or rules that will be used for matching.

The quote from the original white paper:

These tools do not make language syntax design easy, of course. In place of having to determine whether two possible alternatives in a CFG are ambiguous, PEGs present language designers with the analogous challenge of determining whether two alternatives in a ‘/’ expression can be reordered without affecting the language. This question is often obvious, but sometimes is not, and is undecid- able in general. As with discovering ambiguity in CFGs, however, we have the hope of finding automatic algorithms to identify order sensitivity or insensitivity conservatively in common situations.

In this simple case the PEG.js could be a little smarter and recognize that rules you specified are ambiguous. Probably worth to ask the author.

Jacquetta answered 6/8, 2014 at 5:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.