Is there any example of language grammar that possible for Yacc to express but impossible for Antlr4?
Asked Answered
U

2

7

I try to learn about language parser recently and always seen the review about difference in Yacc and Antlr (about LALR and LL). It was always some concluded wording like "LALR is more powerful". But I can't understand what its really means

So could anyone please enlighten me what is the meaning of the word powerful here?

I just assume that it would be meant "Yacc can do something Antlr can't do", if it is I wish I could see the exact example about it

Unlace answered 29/8, 2019 at 5:15 Comment(5)
It is a vague way of saying that the set of LALR(1) grammars is a superset of the set of LL(1) grammars, so the answer is 'yes'. You will find examples in any good compiler textbook.Normie
But ANTLR is LL(*) (more precisely ALL) parser, not LL(1).Overskirt
Do you mean example of a language? Ore example of a grammar? Or both? For instance, some grammars are not LL(1), but describe a language for which a LL(1) grammar exists. That happens when we "refactor" a grammar to make it LL(1) without changing the language.Genuflect
@Genuflect I mean both. The example of grammar and the language it would produce that Yacc could express but antlr4 couldn'tUnlace
@IvanKochurkin There are thereoms that state that any LALR(k) grammar can be rewritten as LALR(1), and similarly for LR(k) and LL(k), so that is a distinction without a difference.Normie
F
5

A language that is LR(1) but not LL(*)

The question Language theoretic comparison of LL and LR grammars has an answer with the following language that is LR(1) but not LL(*):

{ a^i b^j | i≥j }

That is, some number of a followed by an equal or lesser number of b.

The same language is cited by this answer to the similar question Example of an LR grammar that cannot be represented by LL?. However, the present question is different because that one says "LL", meaning LL(k), whereas here we're asking about LL(*) (and Antlr4).

Intuitive demonstration (not a proof)

Let's intuitively argue that this is LR(1) but not LL(*).

First, the LR(1) grammar (copied from the second linked answer):

S ::= a S | P
P ::= a P b | <empty>

Intuitively, this is LR(1) because an LR(1) parser can push any number of a symbols onto its stack, then when it gets to the first b, begin popping the corresponding a symbols in a,b pairs, using the first production for P. If it runs out of b symbols, it pops the remaining a symbols using the first production for S. If it runs out of a symbols while there are still b symbols left, it signals an error. (Remember, in this context, we're primarily concerned with recognition, so the output is either "yes" or "error".)

In contrast, this grammar is not LL(*). Intuitively, an LL(*) parser must decide, when it sees the first a, whether to use the first or second production of S. It would like to look ahead to see if there are as many b symbols as a symbols remaining, because if not, then it would know it has to use the first production to "burn" the excess a symbols. But LL(*) lookahead is limited to recognizing a regular language, and a regular language cannot recognize { a^i b^i } since it cannot "count".

Of course, the fact that one grammar is not LL(*) does not imply that the language is not LL(*), because there might be a more clever grammar. To prove it is not LL(*), I would probably start with the formal definition, assume I had a grammar with those conditions, and then use a pumping lemma argument to show it can't correctly recognize the language of interest. But I'll let the linked resources suffice as rigorous justification that the language is not LL(*).

Higher level interpretation

The way I think about it, LL makes decisions on the way "down" the parse tree, while LR makes them on the way "up". To make a language that is not LL(k), we arrange it so a putative parser would have to commit to an interpretation of a symbol when the information it needs to do so is beyond the horizon of k symbols. To make it not LL(*), we need to put the crucial information beyond a horizon that can only be crossed by first recognizing a non-regular language.

In contrast, LR can push symbols onto its stack, delaying their interpretation until it has seen the end of the associated production and already constructed interpretations of everything in between.

To try to make this a bit more concrete, imagine a programming language that has two kinds of brace-enclosed things, say, code blocks and object literals (like Javascript). Imagine they both can occur in the same context (unlike Javascript):

  var x = { console.log("I am a code block"); /*result is*/ 6; };
  var x = { a:1, b:2 };

In that context, a parser encounters {. LL must decide immediately whether that's the start of a code block or an object literal. In Javascript, object literal keys have to be identifiers or string literals, and the union of both of those is a regular language, so an LL(*) parser can skip past the regex for "identifier or stringlit" to check for :, which would signal object literal (otherwise code block).

  {                    // hmmm, code or object?
  { a                  // possible object literal key
  { a :                // a-ha! definitely object literal

If instead a key could be an arbitrary string-typed expression, then LL(*) is in trouble because it has to balance parentheses to get past the putative key so it can check for the ::

  {                    // start of object literal?
  { (                  // uh-oh ...
  { (a                 // I'm
  { (a ?               //     getting
  { (a ? b             //             lost
  { (a ? b :           // is this the ':' after a key? help!

In contrast, LR happily defers the interpretation of {, pushing it onto a stack, and proceeds, in effect, with two potential interpretations until some token disambiguates them.

Hopefully this provides some intuition for what sorts of things LR contains but LL(*) does not.

There are examples of the reverse (LL(*) but not LR), although I don't know what they look like ("not LR" is a hard class to think about); see the first linked question for details.

Antlr4 semantic predicates

Now, the question title actually asks about Antlr4. Antlr4 has semantic predicates, effectively allowing the programmer to insert arbitrary lookahead computation. So if you're willing to step outside the grammar formalism, in fact there is no limit (short of decidability) on what an Anltr4 parser can recognize.

Ferrochromium answered 7/9, 2019 at 17:50 Comment(5)
You can add arbitrary predicates to YACC reductions too, so that isn't a differentiating feature.Extravagant
@IraBaxter I wasn't intending to say it was, but I can see how it might appear that way. So that I can add it to my answer, do you have a reference for Yacc predicates? I don't see semantic predicates in classic Yacc, and it appears Bison only has them for the GLR parser.Ferrochromium
I haven't done this with bison or yacc, but have done with my own LALR parsers and I assume you can do that same with their engines, after all they are open source. You hack the parsing engine code; find a place where a reduction or shift is going to proposed, and add your semantic check there. it isn't pretty or fun, but then no ad hoc change to a standard parsing engine is. Granted, this is easier when the designer of the parsing engine adds hooks for it.Extravagant
Same McPeak of Elkhound fame? You should have written my answer :-}Extravagant
@IraBaxter Yep, that was my MS thesis. :)Ferrochromium
E
1

Here's an example grammar, for which there are examples that neither ANTLR nor YACC can parse without hackery:

   stmt = declaration ';' ;
   stmt = expression ';' ;
   declaration = type  ID ;
   type = ID '*' ;
   expression = product ;
   product = ID ;
   product = product '*' ID ;

Here's an example that neither ANTLR or L(AL)R can parse:

x * y ;

using this grammar.

There are two possible parses: 1) as a statement, 2) as a declaration.

This example is taken from the C language. (You can make a smaller grammar; I tried to leave this in a form that was easy to understand its realism).

L(ALR) parsers and ANTLR will provide you at most one derivation. Which means they will each miss any alternative.

One can hack the parsing machinery to resolve this (as GCC famously used to do) by bringing in symbol table information. That tangles parsing and symbol table construction which IMHO simply makes a mess out of the parser. In particular, you can't decide what it accepts by just looking at the grammar.

The problem is that ANTLR and L(AL)R will not parse all context free langauges. They parse only (different) subsets; this means one can parse certain instances that the other cannot and vice versa. This means one is not more powerful than the other.

If you want a more powerful parser, e.g., handles fully context free languages, you need to look at Earley parsers, or GLR or GLL parsers. Earley parsers aren't very efficient. GLR and GLL tend to be efficient parsers on parts of the grammar which don't introduce ambiguities (multiple parses) or require large amounts of lookahead. But most programming language construct you might want to parse tend not to be confusing because people have a hard time reading them in that case, too.

Because of the limits of L(AL)R and ANTLR, my company (Semantic Designs) uses GLR parsers for some 40 languages. This includes full C++17 which has more ambiguous parses and long-lookahead cases than you can shake a stick at, including the above example, most of them a lot more obscure. As far as I know, we have the only C++17 parser that uses a grammar directly; GCC and Clang use hand-coded recursive descent with lots of tangled symbol table checks.

[The author of the other answer, Scott McPeak, did build C++ parsers for older dialects of C++ using GLR, more or less at the same time we started to do this. Hats off to him.]

Extravagant answered 7/9, 2019 at 20:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.