Why can't C++ be parsed with a LR(1) parser?
Asked Answered
L

6

162

I was reading about parsers and parser generators and found this statement in wikipedia's LR parsing -page:

Many programming languages can be parsed using some variation of an LR parser. One notable exception is C++.

Why is it so? What particular property of C++ causes it to be impossible to parse with LR parsers?

Using google, I only found that C can be perfectly parsed with LR(1) but C++ requires LR(∞).

Lyford answered 28/10, 2008 at 13:49 Comment(2)
Just like: you need to understand recursion to learn recursion ;-).Andonis
"You'll understand parsers once you'll parse this phrase."Niblick
T
99

There is an interesting thread on Lambda the Ultimate that discusses the LALR grammar for C++.

It includes a link to a PhD thesis that includes a discussion of C++ parsing, which states that:

"C++ grammar is ambiguous, context-dependent and potentially requires infinite lookahead to resolve some ambiguities".

It goes on to give a number of examples (see page 147 of the pdf).

The example is:

int(x), y, *const z;

meaning

int x;
int y;
int *const z;

Compare to:

int(x), y, new int;

meaning

(int(x)), (y), (new int));

(a comma-separated expression).

The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.

Tacnaarica answered 28/10, 2008 at 14:1 Comment(4)
It'd be cool to have some summary about the page 147 on this page. I'm going to read that page though. (+1)Lyford
The example is: int(x), y, *const z; //meaning: int x; int y; int *const z; (a sequence of declarations) int(x), y, new int; //meaning: (int(x)), (y), (new int)); (a comma-separated expression) The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.Spoilfive
Well, in that context ∞ means "arbitrarily many" because the lookahead will always be bounded by the input length.Termination
I'm quite puzzled by the citation extracted from a PhD Thesis. If there is an ambiguïty, then, by definition, NO lookahead may ever "resolve" the ambiguity (i.e. decide which parse is the correct oen, as at least 2 parses are considered correct by the grammar). Moreover, the citation mentions the ambiguity of C but the explanation, does not show an ambiguity, but only a non ambiguous example where parsing decision can only be taken after an arbitrary long look-ahead.Scrabble
D
249

LR parsers can't handle ambiguous grammar rules, by design. (Made the theory easier back in the 1970s when the ideas were being worked out).

C and C++ both allow the following statement:

x * y ;

It has two different parses:

  1. It can be the declaration of y, as pointer to type x
  2. It can be a multiply of x and y, throwing away the answer.

Now, you might think the latter is stupid and should be ignored. Most would agree with you; however, there are cases where it might have a side effect (e.g., if multiply is overloaded). but that isn't the point. The point is there are two different parses, and therefore a program can mean different things depending on how this should have been parsed.

The compiler must accept the appropriate one under the appropriate circumstances, and in the absence of any other information (e.g., knowledge of the type of x) must collect both in order to decide later what to do. Thus a grammar must allow this. And that makes the grammar ambiguous.

Thus pure LR parsing can't handle this. Nor can many other widely available parser generators, such as Antlr, JavaCC, YACC, or traditional Bison, or even PEG-style parsers, used in a "pure" way.

There are lots of more complicated cases (parsing template syntax requires arbitrary lookahead, whereas LALR(k) can look ahead at most k tokens), but only it only takes one counterexample to shoot down pure LR (or the others) parsing.

Most real C/C++ parsers handle this example by using some kind of deterministic parser with an extra hack: they intertwine parsing with symbol table collection... so that by the time "x" is encountered, the parser knows if x is a type or not, and can thus choose between the two potential parses. But a parser that does this isn't context free, and LR parsers (the pure ones, etc.) are (at best) context free.

One can cheat, and add per-rule reduction-time semantic checks in the to LR parsers to do this disambiguation. (This code often isn't simple). Most of the other parser types have some means to add semantic checks at various points in the parsing, that can be used to do this.

And if you cheat enough, you can make LR parsers work for C and C++. The GCC guys did for awhile, but gave it up for hand-coded parsing, I think because they wanted better error diagnostics.

There's another approach, though, which is nice and clean and parses C and C++ just fine without any symbol table hackery: GLR parsers. These are full context free parsers (having effectively infinite lookahead). GLR parsers simply accept both parses, producing a "tree" (actually a directed acyclic graph that is mostly tree like) that represents the ambiguous parse. A post-parsing pass can resolve the ambiguities.

We use this technique in the C and C++ front ends for our DMS Software Reengineering Tookit (as of June 2017 these handle full C++17 in MS and GNU dialects). They have been used to process millions of lines of large C and C++ systems, with complete, precise parses producing ASTs with complete details of the source code. (See the AST for C++'s most vexing parse.)

Doralynn answered 17/6, 2009 at 2:1 Comment(20)
While the 'x * y' example is interesting, the same can happen in C ('y' can be a typedef or a variable). But C can be parsed by a LR(1) parser, so what's the difference with C++?Trawl
My answser had already observed that C had the same problem, I think you missed that. No, it can't be parsed by LR(1), for the same reason. Er, what do you mean 'y' can be a typedef? Perhaps you meant 'x'? That doesn't change anything.Doralynn
Parse 2 is not necessarily stupid in C++, as * could be overridden to have side effects.Protease
In the provided example, it doesn't have any side effects. You are right, there's variations on this theme (say, overloaded "") which could have side effects. As stated, people *might think it was stupid. I've adjusted the text.Doralynn
I would count overloading a stateless operator to include side effects as a dumb use (flame bait but I'll elaborate). I always had that problem with operator overloading: none of the properties of an operator that would make it a reasonable analogy to the original use are guaranteed. For example, the comparison operators < > <= >= absolutely don't have to work as a partial order.Cosecant
@altie: agreed, when you have overloading you want something like the Liskov substitution principle, which says the key properties of the overloaded operator are preserved. Alas, none of the OO langauges I know about (except Liskov's CLU) insist on that. So, you can object that the overload in C++ is dumb, but it is legal.Doralynn
I looked at x * y and chuckled - it's amazing how little one thinks of nifty little ambiguities like this.Trattoria
The question is about the difference between C and C++ - C can be parsed by LALR(1) with the lexer hack (which you describe), while it seems (from other answers) that even that is impossible for C++.Spoilfive
@Blaisorblade: What people do is get some parser technology that works for most cases, and then hack it handle other cases, either by accepting "too much" and eliminating that later, or by bending the parsing machinery (lexer hack as one example, ad hoc look-ahead, etc.) So I think you can do it; its just awful.Doralynn
@IraBaxter: indeed, apparently the PhD thesis cited by Rob Walker, about Fog, implements a parser for C++ (in the middle of implementing some other goal) consisting of a LALR(1) superset grammar and defers significant processing to resolve ambiguities to subsequent stages. So in a way you're right - although this is much uglier than what's needed for C. In another sense, though, not just C but many other programming languages have some context-sensitive semantic processing after parsing, like for instance identifier resolution and type-checking (I don't know about scripting languages).Spoilfive
@Blaisorblad: you don't have a lot of choice. If your parser accepts "too little", you probably reject real programs. You can't make a parser accept exactly a valid string unless your parser is Turing capable. So, you build a context free parser that accepts too much, and hack away the excess that it accepts.Doralynn
@altie Surely no one would overload a bit-shift operator to make it write most variable types to a stream, right?Downstage
@Troy: How is that different than Java overloading the add operator "+" to concenate two strings?Doralynn
@IraBaxter There are similarities, although it seems like concatenation is a logical interpretation of "adding" when applied to strings, much more than bit-shifting an output stream. If I asked a non-programmer to "add 'abc' to 'def'", I feel I have a decent chance of getting 'defabc' as the answer. If I ask them to "shift the output by 'xyzzy' bits", I feel a very confused expression is about the best I can hope for.Downstage
@Troy: yes, I kind of agree about the intuition (although if you ask most non-C/C++ programmers what "<<" does, you won't get an answer). But in either case, the semantics of the different interpretations are simply... different. So its case of what arbitrary semantics, do you want to attach to what arbitrary syntax?Doralynn
(x)-x is also an example of ambiguity, where (x) may be interpreted as an expression or as a type casting. Moreover, this expression may be used in any meaningful context, avoiding the discussion about overloading *.Scrabble
Grasshopper: you are swimming against the current. You can argue that C's grammar is unnecessarily complex; if you follow that argument to the limit you can always define your language as S-expressions as LISP does, and then parsers are basically trivial. If you think that LISP S-expressions are clumsy, then you will design a language with lots of idiomatic forms. Over time, people graft extra stuff in. ....Doralynn
... The modern version of pretty much every language I'm familiar with, has things that make recursive descent parsing harder than you'd like. If you use the right parser generator, you stop caring because these variations are all handled by the parser generator without fuss. (Seriously, check out GLL or GLR Parsing). Regarding C, you'll find parsing it harder than you think due to the need to implement an accurate preprocessor, and (if you insist on recursive descent), the need to track types during parsing. That latter for me tangles parsing and symbol tables, and that's just stupid.Doralynn
If I understand things correctly, modern C++ allowing a type definition typedef int x; as part of a class definition before or after some method of said class containing x * y;, would ostensibly make the program unparsable for LR parsers with insufficient look-ahead, specifically even those that would employ a "hack" utilizing symbol collection. The problem lies in that they'd simply have to defer determining nature of x, until a later typedef, if there's one at all. Solving that would make for a far more complicated hack, wouldn't it?Locomobile
@ArmenMichaeli: (Yes). If you insist your parser needs type information for identifiers, and your language allows identifer use earlier in the source, with type declaration later, then LALR parsing will be a complete disaster precisely because it is left to right parsing. It will encounter the identifier without type information... now what does it do to proceed? You can hack the LALR parser to backtrack, and then perhaps you can assume a type on first encounter and verify it later. Even that will fail if an identifier use might be of many types because of too much backtracking.Doralynn
T
99

There is an interesting thread on Lambda the Ultimate that discusses the LALR grammar for C++.

It includes a link to a PhD thesis that includes a discussion of C++ parsing, which states that:

"C++ grammar is ambiguous, context-dependent and potentially requires infinite lookahead to resolve some ambiguities".

It goes on to give a number of examples (see page 147 of the pdf).

The example is:

int(x), y, *const z;

meaning

int x;
int y;
int *const z;

Compare to:

int(x), y, new int;

meaning

(int(x)), (y), (new int));

(a comma-separated expression).

The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.

Tacnaarica answered 28/10, 2008 at 14:1 Comment(4)
It'd be cool to have some summary about the page 147 on this page. I'm going to read that page though. (+1)Lyford
The example is: int(x), y, *const z; //meaning: int x; int y; int *const z; (a sequence of declarations) int(x), y, new int; //meaning: (int(x)), (y), (new int)); (a comma-separated expression) The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.Spoilfive
Well, in that context ∞ means "arbitrarily many" because the lookahead will always be bounded by the input length.Termination
I'm quite puzzled by the citation extracted from a PhD Thesis. If there is an ambiguïty, then, by definition, NO lookahead may ever "resolve" the ambiguity (i.e. decide which parse is the correct oen, as at least 2 parses are considered correct by the grammar). Moreover, the citation mentions the ambiguity of C but the explanation, does not show an ambiguity, but only a non ambiguous example where parsing decision can only be taken after an arbitrary long look-ahead.Scrabble
T
15

The problem is never defined like this, whereas it should be interesting :

what is the smallest set of modifications to C++ grammar that would be necessary so that this new grammar could be perfectly parsed by a "non-context-free" yacc parser ? (making use only of one 'hack' : the typename/identifier disambiguation, the parser informing the lexer of every typedef/class/struct)

I see a few ones:

  1. Type Type; is forbidden. An identifier declared as a typename cannot become a non-typename identifier (note that struct Type Type is not ambiguous and could be still allowed).

    There are 3 types of names tokens :

    • types : builtin-type or because of a typedef/class/struct
    • template-functions
    • identifiers : functions/methods and variables/objects

    Considering template-functions as different tokens solves the func< ambiguity. If func is a template-function name, then < must be the beginning of a template parameter list, otherwise func is a function pointer and < is the comparison operator.

  2. Type a(2); is an object instantiation. Type a(); and Type a(int) are function prototypes.

  3. int (k); is completely forbidden, should be written int k;

  4. typedef int func_type(); and typedef int (func_type)(); are forbidden.

    A function typedef must be a function pointer typedef : typedef int (*func_ptr_type)();

  5. template recursion is limited to 1024, otherwise an increased maximum could be passed as an option to the compiler.

  6. int a,b,c[9],*d,(*f)(), (*g)()[9], h(char); could be forbidden too, replaced by int a,b,c[9],*d; int (*f)();

    int (*g)()[9];

    int h(char);

    one line per function prototype or function pointer declaration.

    An highly preferred alternative would be to change the awful function pointer syntax,

    int (MyClass::*MethodPtr)(char*);

    being resyntaxed as:

    int (MyClass::*)(char*) MethodPtr;

    this being coherent with the cast operator (int (MyClass::*)(char*))

  7. typedef int type, *type_ptr; could be forbidden too : one line per typedef. Thus it would become

    typedef int type;

    typedef int *type_ptr;

  8. sizeof int, sizeof char, sizeof long long and co. could be declared in each source file. Thus, each source file making use of the type int should begin with

    #type int : signed_integer(4)

    and unsigned_integer(4) would be forbidden outside of that #type directive this would be a big step into the stupid sizeof int ambiguity present in so many C++ headers

The compiler implementing the resyntaxed C++ would, if encountering a C++ source making use of ambiguous syntax, move source.cpp too an ambiguous_syntax folder, and would create automatically an unambiguous translated source.cpp before compiling it.

Please add your ambiguous C++ syntaxes if you know some!

Twomey answered 13/2, 2013 at 11:37 Comment(3)
C++ is too well entrenched. Nobody will do this in practice. Those folks (like us) that build front ends simply bite the bullet and do the engineering to make the parsers work. And, as long as templates exist in the language, you're not going to get a pure context-free parser.Doralynn
@IraBaxter, and then along came good ol' Herb with cppfront. He even states he wants a CFG for C++.Miquelmiquela
@SO_fix_the_vote_sorting_bug: So, 10 years after my last remark, somebody tries to build such a tool. Herb is very smart guy, and he has built himself a toy (quote: "hilariously incomplete"). When it is production quality and widely accepted, we can re-discuss whether it will take over from standard C++.Doralynn
C
9

As you can see in my answer here, C++ contains syntax that cannot be deterministically parsed by an LL or LR parser due to the type resolution stage (typically post-parsing) changing the order of operations, and therefore the fundamental shape of the AST (typically expected to be provided by a first-stage parse).

Chopine answered 2/9, 2009 at 2:5 Comment(5)
Parsing technology that handles ambiguity simply produces both AST variants as they parse, and simply eliminate the incorrect one depending on the type information.Doralynn
@Ira: Yes, that is correct. The particular advantage of that is it allows you maintain the separation of the first-stage parse. While it's most commonly known in the GLR parser, there's no particular reason I see that you couldn't hit C++ with a "GLL?" parser as well.Chopine
"GLL"? Well, sure, but you'll have to go figure out the theory and write up a paper for the rest of to use. More likely, you could use a top down hand coded parser, or a backtracking LALR() parser (but keep the "rejected") parses, or run an Earley parser. GLR has the advantage of being a pretty damn good solution, is well documented and by now well proved. A GLL technology would have to have some pretty significant advantages to display GLR.Doralynn
The Rascal project (Netherlands) is claiming they are building a scannerless GLL parser. Work in progress, might be hard to find any online information. en.wikipedia.org/wiki/RascalMPLDoralynn
@IraBaxter There seems to be new developments on GLL: see this 2010 paper about GLL dotat.at/tmp/gll.pdfDecelerate
E
6

I think you are pretty close to the answer.

LR(1) means that parsing from left to right needs only one token to look-ahead for the context, whereas LR(∞) means an infinite look-ahead. That is, the parser would have to know everything that was coming in order to figure out where it is now.

Ec answered 28/10, 2008 at 14:5 Comment(6)
I recall from my compilers class that LR(n) for n > 0 is mathematically reducable to LR(1). Is that not true for n = infinity?Demogorgon
No, there's an impassable mountain of a difference between n and infinity.Szymanowski
Isn't the answer: Yes, given an infinite amount of time? :)Sikes
Actually, by my vague recollection of how LR(n) -> LR(1) takes place, it involves creating new intermediate states, so the runtime is some non-constant function of 'n'. Translating LR(inf) -> LR(1) would take infinite time.Quadriceps
"Isn't the answer: Yes, given an infinite amount of time?" -- No: the phrase 'given an infinite amount of time' is just a non-sensical, short-hand way of saying "cannot be done given any finite amount of time". When you see "infinite", think: "not any finite".Kodak
Infinite lookahead may be necessary, but it isn't sufficient. See my answer on the necessity to use context to resolve the choice.Doralynn
U
6

The "typedef" problem in C++ can be parsed with an LALR(1) parser that builds a symbol table while parsing (not a pure LALR parser). The "template" problem probably cannot be solved with this method. The advantage of this kind of LALR(1) parser is that the grammar (shown below) is an LALR(1) grammar (no ambiguity).

/* C Typedef Solution. */

/* Terminal Declarations. */

   <identifier> => lookup();  /* Symbol table lookup. */

/* Rules. */

   Goal        -> [Declaration]... <eof>               +> goal_

   Declaration -> Type... VarList ';'                  +> decl_
               -> typedef Type... TypeVarList ';'      +> typedecl_

   VarList     -> Var /','...     
   TypeVarList -> TypeVar /','...

   Var         -> [Ptr]... Identifier 
   TypeVar     -> [Ptr]... TypeIdentifier                               

   Identifier     -> <identifier>       +> identifier_(1)      
   TypeIdentifier -> <identifier>      =+> typedefidentifier_(1,{typedef})

// The above line will assign {typedef} to the <identifier>,  
// because {typedef} is the second argument of the action typeidentifier_(). 
// This handles the context-sensitive feature of the C++ language.

   Ptr          -> '*'                  +> ptr_

   Type         -> char                 +> type_(1)
                -> int                  +> type_(1)
                -> short                +> type_(1)
                -> unsigned             +> type_(1)
                -> {typedef}            +> type_(1)

/* End Of Grammar. */

The following input can be parsed without a problem:

 typedef int x;
 x * y;

 typedef unsigned int uint, *uintptr;
 uint    a, b, c;
 uintptr p, q, r;

The LRSTAR parser generator reads the above grammar notation and generates a parser that handles the "typedef" problem without ambiguity in the parse tree or AST. (Disclosure: I am the guy who created LRSTAR.)

Unfurl answered 17/8, 2018 at 14:59 Comment(3)
That's the standard hack used by GCC with its former LR parser to handle the ambiguity of things like "x*y;" Alas, there's still the arbitrarily large lookahead requirement to parse other constructs, so LR(k) fails to be a solution any fixed k. (GCC switched to recursive descent with more ad hockery).Doralynn
LRSTAR is at sourceforge.net/projects/lrstarLaktasic
"LRSTAR is at sourceforge.net/projects/lrstar" seems to be a link that is no longer valid !Discredit

© 2022 - 2024 — McMap. All rights reserved.