How does backtracking work in peg.js (with example)?

Asked 16/7, 2014 at 6:15 Answered 6/8, 2014 at 5:10

I've defined the following minimal Peg.js grammar:

start  =  "A1" / "A123"

which you can try in the sandbox.

I would have expected to match "A1" as well as "A123" (according to my notion of how backtracking works). But this is not the case: the grammar recognizes "A1" but not "A123".

Note: I am not looking for the advice "reverse the order of your terms" as in the related question How to transform a simple grammar into something which works in PEG.js (expected "a" but "a" found). Rather, I'm looking to understand the behavior I'm seeing, and why Peg.js's backtracking doesn't apply in this case. For an explanation of why reversing the order of my terms doesn't help, see the more realistic example below.

For a more realistic example, consider units parsing. A grammar should recognize metric units (like "m", "mol") with optional prefixes, like "mm", "mmol", as well as non-metric units like "yr", "week", or "mo".

The following Peg.js grammar won't recognize "mol" because it gets tripped up consuming "mo", and doesn't backtrack. (Changing the order of terms doesn't help; or rather, will cause "mo" to be recognized at the expense of "mol" or "mmol".)

start  =  nonmetric / metric / prefix metric
metric = "mol" / "l" / "m" / "g"
nonmetric = "yr" / "mo" / "week" / "day" / "hour"
prefix = "m" / "k" / "c"

I can do the analagous thing in Antlr with good success:

grammar units;
start  :  nonmetric | metric | prefix metric;
metric : 'mol' | 'l' | 'm' | 'g';
nonmetric : 'yr' | 'mo' | 'week' | 'day' | 'hour';
prefix : 'm' | 'k' | 'c';

Rankle answered 16/7, 2014 at 6:15 Comment(1)

Thanks for the nice examples to this problem when one tries to learn Peg.js coming from Antlr. It really helped me understanding what the hell was wrong with my grammar. – Reyesreykjavik 22/10, 2014 at 23:24

The problem is with the concept of backtracking. PEG parsers don't backtrack like other recursive-descent parsers or Prolog do. Rather, when confronted with a choice, a PEG parser will try every option until one succeeds. Once one succeeds, it will commit to it no matter how the rule was invoked.

From the Wikipedia article:

Unlike in context-free grammars and regular expressions, however, these operators always behave greedily, consuming as much input as possible and never backtracking.

What you ask for in the complex case is the same that's asked in this question. The answer so far is Yes: you must tweak the rules in PEG grammars to make sure that the longest option always gets matched first, even if the result is a somewhat uglier grammar.

One way to tweak PEG grammars is to use lookaheads (that's one of the main reasons why lookaheads are featured in PEG):

start  =  nonmetric / metric / prefix metric
metric = "mol" / "l" / !"mo" "m" / "g"
nonmetric = "yr" / !"mol" "mo" / "week" / "day" / "hour"
prefix = !("mol"/"mo") "m" / "k" / "c"

Dolores answered 17/7, 2014 at 17:9 Comment(6)

Thanks for the background, clear explanation, and description of lookaheads w/ example! – Rankle 19/7, 2014 at 18:28

Thank you for the explanation. For someone with little background in parsers, are there any alternatives you recommend that do offer backtracking? Antlr seems to be the next choice – Group 19/7, 2016 at 15:47

ANTLR is predictive LL(*). It doesn't quite do backtracking, but it can handle a vast variety of parsing cases. antlr.org/papers/allstar-techreport.pdf – Dolores 19/7, 2016 at 15:51

@Dolores I've added a link to this comment in the PEG.js readme – Lobectomy 20/8, 2017 at 12:14

@ChristofferBubach, see en.wikipedia.org/wiki/Parsing_expression_grammar – Dolores 7/1, 2020 at 1:10

Nice, @Futago-zaRyuu! – Dolores 7/1, 2020 at 1:11

This is by design. It's up to you to specify right order or rules that will be used for matching.

The quote from the original white paper:

These tools do not make language syntax design easy, of course. In place of having to determine whether two possible alternatives in a CFG are ambiguous, PEGs present language designers with the analogous challenge of determining whether two alternatives in a ‘/’ expression can be reordered without affecting the language. This question is often obvious, but sometimes is not, and is undecid- able in general. As with discovering ambiguity in CFGs, however, we have the hope of finding automatic algorithms to identify order sensitivity or insensitivity conservatively in common situations.

In this simple case the PEG.js could be a little smarter and recognize that rules you specified are ambiguous. Probably worth to ask the author.

Jacquetta answered 6/8, 2014 at 5:10 Comment(0)

Recommended topics

Hot tags