Limitations of PEG grammar & parser generators? [closed]

Asked 6/12, 2009 at 23:32 Answered 14/8, 2014 at 1:24

Solved parser-generator yard peg php-parser

I was enjoying using YARD a lot:

http://www.ootl.org/yard/

http://code.google.com/p/yardparser/

http://www.codeproject.com/KB/recipes/yard-tokenizer.aspx

I was able to construct fully functional calculator. I'm evaluating YARD to do PHP parser. Please kindly advise on limitations of PEG grammar & parser generators. Thank you very much!

Cythiacyto answered 6/12, 2009 at 23:32 Comment(1)

If parsing PHP, can I recommend using phc? – Frightfully 24/10, 2010 at 17:7

I think the big "problem" with PEGs is that they don't fit into the normal taxonomy of grammars as they operate in a fundamentally different way. Normal grammars are "backwards" in the sense that they describe all the possible sentences (programs) that can be generated. PEGs describe how to parse--they come at the problem from the other end.

In my view this is a more natural way to think about the problem, and certainly for any hand-written (recursive-descent) parser I wouldn't do anything else.

Cymry answered 6/12, 2009 at 23:45 Comment(10)

thanks DrPizza! I read that PEG cannot parse Python and C++ in context sensitive parts. Not sure this is true. I'm trying to write PHP parser and found PEG solutions very easy, compared to those Bison/Yacc. – Cythiacyto 7/12, 2009 at 3:42

Most parsers can't properly deal with context-sensitive grammars without hacks of some kind (e.g. for parsing C you make the parser feed back into the lexer so that it assigns the right symbol type to type names, so that they don't get treated as regular identifiers). PEGs are interesting because they can directly express the disambiguation rules that C and C++ use (I don't know about Python). Specifically, "if it looks like a declaration, it is". They can do this by ordering their rules so that the declaration rule is tried before the statement rule. – Cymry 7/12, 2009 at 4:53

That said, I don't know if PEGs can handle the entire C++ or Python grammars; I've not tried. – Cymry 7/12, 2009 at 4:54

Ordering rules doesn't help, if the meaning of the parse is determined by other information. C++ infamously allows "xy;" as a statement, with two interpretations: a declaration, or an arithmetic operation. No rule ordering will help you decide what this is. You need context information. C and C++ parsers often hack this by building a symbol table as they go; knowing that x is a type resolves the problem. But if the definition of x or y appears *after the statement, even this hack doesn't work. The safe bet is GLR parsers, which simply pick up both parses to resolve later. – Gooseflesh 8/12, 2009 at 10:38

+1 Nice example Ira! So I conclude that PEG are not suitable/optimal for C/C++ parsing. What about using it for PHP? – Cythiacyto 8/12, 2009 at 23:49

Ah yes, of course. The solution as ever is to have actions associated with rules; Boost's Spirit2 parser framework uses PEGs as its underlying model, and I believe it allows suitable actions--every successful type declaration should add its identifiers to a table of type names. In conjunction with PEG ordering (declaration rules are tried before expression rules), the PEG will do the right thing. The Spirit2 source is unfortunately the usual impenetrable stuff we expect from Boost. – Cymry 9/12, 2009 at 7:12

The real value of PEGs, btw, is in designing your own languages; using a PEG ensures that the language is unambiguously parsable, rather than the traditional approach to language design which is to either not care about parsing (and come up with abominable syntaxes like C and C++) or to design a syntax and then beat on it until it eventually becomes something that your tool (traditionally yacc) can actually parse. By making the fundamental operation parsing (rather than sentence-generation), PEGs make this aspect of language design much easier. – Cymry 9/12, 2009 at 7:15

(adding semantic actions that update a table of typenames does, however, cause a problem if you want to memoize your results, as is done in Packrat parsing) – Cymry 9/12, 2009 at 7:22

You shouldn't design your language around a parsing technology, unless you insist that your langauge will not evolve, ever, or you insist that any evolution use that same parsing technology. This is putting the cart before the horse. When you design the language, you want to go for expressiveness and readability. The back-room compiler guys may suffer a bit for that (I'm one of them), but there aren't very many of them, and there are presumably lots of lots of users of the language. Optimize for them, not for the compiler guys (or worse, just the parser guys). – Gooseflesh 14/8, 2014 at 14:41

I note nobody addressed the part of the question about parser generators in general. – Gooseflesh 14/8, 2014 at 16:44

The main limitation of PEG grammars is that they don't deal with ambiguity at all.

To be sure, this is also their strength since dealing with ambiguities is one of the most frustrating parts of using a CFG (context free grammar) tool.

With PEGs you deal with ambiguities explicitly by ordering the rule you want to match before another rule that would match ambiguously but which you don't want.

The problem is that you don't always even know about some or even any of the ambiguities in a language or a grammar and PEG generators, at least the ones I've tried, don't analyse the grammar for ambiguity to help you find them and then design and order your rules to deal with them the right way.

CFG parser generators like yacc and bison analyse your grammar and report all the ambiguities. Unfortunately they often report them in a pretty cryptic way that can be hard to make sense of. And of course it's often hard to fix the grammar to deal with them. But at least you will be aware that they exist.

With a PEG grammar you can be blissfully ignorant of the ambiguities in your conceptual grammar because once you make it a PEG it doesn't have ambiguities any more, it just has matching rules and maybe silently unreachable rules which would also match if they had higher precedence. These might not show up in your testing but may show up after release.

With CFG grammars you are forced to deal with ambiguities during development, but it won't be easy.

In the event I'm not making it clear, here's a six-year-old discussion by Joshua Haberman over on the Lambda the Ultimate programming languages blog: PEGs and Packrat Parsing are not the answer (archive.org).

Demakis answered 14/8, 2014 at 1:24 Comment(5)

once you make it PEG it doesn't have ambiguities anymore. True, you can take what PEG forces on you ("this preferred to that") as the answer. But in many cases, especially to support expressiveness, it is better for the language to be ambiguous and resolve that ambiguity using the non-context-free information from the code. While I won't claim that the ambiguities that C++ has everywhere are necessarily helpful, if you switch to GLR you can also be blissfully ignorant of ambiguities in the parsing process. (Does PEG do arbitrary lookahead?) See https://mcmap.net/q/81747/-why-can-39-t-c-be-parsed-with-a-lr-1-parser – Gooseflesh 14/8, 2014 at 14:46

I'm also not sure that PEG will do arbitrary lookahead. – Gooseflesh 14/8, 2014 at 14:49

I just started playing with PEG a few days ago so I'm no expert but I was pretty certain they to arbitrary lookahead, which is actually another thing I've seen people pick out as another mark against them. I've been hunting for a GLR tool for JavaScript too but haven't found one on a par with PEG.js and Jison. – Demakis 14/8, 2014 at 14:55

@IraBaxter what type of parsing technique do you recommend? – Celio 17/7, 2015 at 16:16

I've had very good experience with GLR. See my answer at Quora: qr.ae/RRQctF – Gooseflesh 17/7, 2015 at 19:7

Recommended topics

Hot tags