Why is bottom-up parsing more common than top-down parsing?

Asked 30/11, 2010 at 17:5 Answered 12/11, 2015 at 5:40

It seems that recursive-descent parsers are not only the simplest to explain, but also the simplest to design and maintain. They aren't limited to LALR(1) grammars, and the code itself can be understood by mere mortals. In contrast, bottom up parsers have limits on the grammars they are able to recognize, and need to be generated by special tools (because the tables that drive them are next-to-impossible to generate by hand).

Why then, is bottom-up (i.e. shift-reduce) parsing more common than top-down (i.e. recursive descent) parsing?

Stellate answered 30/11, 2010 at 17:5 Comment(2)

Pure recursive descent is roughly the same as LL(1), which is strictly less powerful than LALR(1). You can build a recursive descent parser and cheat with lookahead hacks, context checks (as was done by GCC's original C++ parser), etc., to handle more complicated langauges. But if you are going to cheat, you can cheat with LALR parsers too. So it gets hard to compare techniques when arbitrary cheating is allowed. What you can say is cheating is easier to hack into a hand-coded parser. But, better not to cheat, and simply use a very strong parsing engine. See my answer. – Bromism 7/1, 2013 at 16:50

It's much more difficult for a bottom-up parser to "cheat" than for a top-down one: "And bottom-up parsing was very unfriendly to custom hacks, which made every shortcoming loom large. It is much harder to work around a problem in a bottom-up parser than than it is to deal with a similar shortcoming in a top-down parser. " – Cia 1/12, 2022 at 17:40

If you choose a powerful parser generator, you can code your grammar without worrying about peculiar properties. (LA)LR means you don't have to worry about left recursion, one less headache. GLR means you don't have to worry about local ambiguity or lookahead.

And the bottom-up parsers tend to be pretty efficient. So, once you've paid the price of a bit of complicated machinery, it is easier to write grammars and the parsers perform well.

You should expect to see this kind of choice wherever there is some programming construct that commonly occurs: if it is easier to specify, and it performs pretty well, even if the machinery is complicated, complex machinery will win. As another example, the database world has gone to relational tools, in spite of the fact that you can hand-build an indexed file yourself. It's easier to write the data schemas, it's easier to specify the indexes, and with complicated enough machinery behind (you don't have to look at the gears, you just use them), they can be pretty fast with almost no effort. Same reasons.

Bromism answered 30/11, 2010 at 18:36 Comment(5)

Well, you don't have to worry about left recursion, you just have to worry about right recursion instead. Otherwise +1 :) – Stellate 30/11, 2010 at 19:25

Why do you have worry about right recursion? The bottom up generators handle that just fine. – Bromism 30/11, 2010 at 20:4

At least in lex/yacc or flex/bison land, right recursive rules blow out the stack used in shift-reduce parsing. Sure, it can still recognize the grammar, but it will perform badly. (At least according to this). Perhaps other parser generators handle this better but I've not used any other such tools. – Stellate 30/11, 2010 at 21:55

They only blow out the stack if the list is extremely long, which is pretty rare in practice. A recursive descent parser will have exactly the same problem with very long recursive constructs. Some would say this isn't s defect of the parser, but rather a defect of using a fixed-size execution stack, as offered by most programming langauges on Windows or Linux platforms. (Some of us use languages with heap-allocated activation records; you "blow out the stack" only when you run out of virtual memory). – Bromism 30/11, 2010 at 21:59

Sorry, I used "Blow out the stack" incorrectly -- I basically meant consume lots of stack space. I believe Yacc and Bison both use heap allocated stacks for shift reduce parsing anyway (though I could be wrong). – Stellate 30/11, 2010 at 22:32

It stems from a couple different things.

BNF (and the theory of grammars and such) comes from computational linguistics: folks researching natural language parsing. BNF is a very attractive way of describing a grammar, and so it's natural to want to consume these notation to produce a parser.

Unfortunately, top-down parsing techniques tend to fall over when applied to such notations, because they cannot handle many common cases (e.g., left recursion). This leaves you with the LR family, which performs well and can handle the grammars, and since they're being produced by a machine, who cares what the code looks like?

You're right, though: top-down parsers work more "intuitively," so they're easier to debug and maintain, and once you have a little practice they're just as easy to write as those generated by tools. (Especially when you get into shift/reduce conflict hell.) Many of the answers talk about parsing performance, but in practice top-down parsers can often be optimized to be as fast as machine-generated parsers.

That's why many production compilers use hand-written lexers and parsers.

Hadwin answered 30/11, 2010 at 22:8 Comment(6)

What cases other than left recursion are there? (I'm not aware of any...) – Stellate 30/11, 2010 at 22:30

It doesn't come up in programming languages, but a production like "P = ('x' P 'x') | 'x'" is problematic for all but the most robust CFG machines, even though it is completely unambiguous. – Hadwin 30/11, 2010 at 22:36

I don't see how. That's parsable with no problem using recursive descent. I.e.: P : if (next item == 'x') { if (P()) then { do first branch} else {do second branch} return true;} else { return false; } It's even simpler if you're allowed more tokens of lookahead. (To be fair, I think most LALR(1) parser generators take a long time to work on such a production too). Most importantly though is that nobody would ever write a grammar like that when they could write P: x+ instead. – Stellate 30/11, 2010 at 23:21

Actually, that grammar is just not in LR(k) or LL(k) for any k. The problem is that you don't know whether to execute the left or the right at any point, because you don't know whether you've reached the middle of the string. (Try actually running your algorithm on some test input to get a feel for the problem: "xxxxx" should parse and "xxxxxx" should not.) Of course, when you want to say things like "nobody would ever write a grammar like that" then you're right; practically, most programming languages fit well into LR(k), which is why bottom-up parsing is so popular, as you asked. – Hadwin 30/11, 2010 at 23:45

I'm afraid I don't understand. LL(k) parsers cannot recognize that, but recursive descent parsers can without a problem. Which again brings me to my question of why the bottom-up method is more popular. Just because it's not often that you need that kind of power doesn't mean that it's not useful. (Besides, it's easy for a recursive descent parser to simply look at that, realize one can count xs and assert that the number of Xs is odd, and do the whole thing simply and in linear time) – Stellate 1/12, 2010 at 3:36

I think bottom up parsers are more intuitive because top down is like you are trying to create the input, when it is already provided, but bottom up is like you have the input and you just need to figure out what parts are what nonterminals. – Tocsin 6/11, 2014 at 2:34

Recursive-descent parsers try to hypothesize the general structure of the input string, which means a lot of trial-and-error occurs before the end of the string is reached. This makes them less efficient than bottom-up parsers, which need no such inference engines.

The performance difference becomes magnified as the complexity of the grammar increases.

Clydeclydebank answered 30/11, 2010 at 17:25 Comment(1)

+1 -- never considered that constant factors for recursive descent might be poor. – Stellate 30/11, 2010 at 17:29

To add to other answers, it is important to realize that besides efficiency, bottom-up parsers can accept significantly more grammars than recursive-descent parsers do. Top-down parsers -whether predictive or not- can only have 1 lookahead token and fail if current token and whatever immediately follows the token can be derived using two different rules. Of course, you could implement the parser to have more lookaheads (e.g. LL(3)), but how far are you willing to push it before it becomes as complex as a bottom-up parser? Bottom-up parsers (LALR specifically) on the other hand, maintain a list of firsts and follows and can handle cases where top-down parsers can't.

Of course, computer science is about trade-offs. If you grammar is simple enough, it makes sense to write a top-down parser. If it's complex (such as grammars of most programming languages), then you might have to use a bottom-up parser to successfully accept the input.

Southport answered 8/8, 2014 at 18:48 Comment(6)

Recursive-descent parsers can have arbitrary lookahead -- case in point, C++ requires arbitrary lookahead and is parsed by GCC and Clang -- both recursive-descent based systems -- just fine. LALR is much less capable than other parser technologies in the general case. – Stellate 9/8, 2014 at 2:0

Yes, I did mention that they can have any number of lookaheads, but that just adds to their complexity. And how is LALR(1) LESS capable than, say, LL(1)?? – Southport 9/8, 2014 at 2:47

LL is allowed to backtrack, LALR is not. en.wikipedia.org/wiki/LALR_parser See "Relation to other parsers" at the bottom of the article – Stellate 9/8, 2014 at 7:15

Yes, top-down parsers backtrack because they have to "make a guess" if there's more than one rule that matches the token. Bottom-up parsers on the other hand, don't need this since at each stage, the parser can either reduce, or if that's not possible, shift. There's no guess or trial and error. And that's exactly why they are also more efficient. – Southport 9/8, 2014 at 16:51

Except there are grammars that cannot be matched even with arbitrary look ahead without backtracking. That's why bison added GLR support, for example. – Stellate 9/8, 2014 at 16:54

the gcc c++ parser is recursive descent. I don't think complex language => bottom up. – Talca 3/9, 2016 at 6:21

I have two guesses, though I doubt that either fully explains it:

Top-down parsing can be slow. Recursive-descent parsers may require exponential time to complete their job. That would put severe limitations on the scalability of a compiler which uses a top-down parser.
Better tools. If you can express the language in some variant of EBNF, then chances are you can Lex/Yacc your way past a whole lot of tedious code. There don't seem to be as many tools to help automate the task of putting together a top-down parser. And let's face it, grinding out parser code just isn't the fun part of toying with languages.

Interne answered 30/11, 2010 at 17:22 Comment(4)

Well, they're not going to require exponential time on the kinds of grammars that an LALR(1) parser could support. Sure, they could take longer theoretical time, but that's the result of being more powerful. +1 for the second point... didn't think of that. – Stellate 30/11, 2010 at 17:29

Fair point - but if you're sticking with a LALR(1) grammar, why not just stay on the well-beaten path and use a LALR parser as well? – Interne 30/11, 2010 at 17:32

Concerning #2: There are a few parser generators that output recursive descent parsers (never used any, just referring to wiki) - e.g. ANTLR. – Dogtired 30/11, 2010 at 18:0

Often because I don't want to have to deal with having to run a parser generator all the time. (And because most parser generators cannot build parsers that correctly handle Unicode) – Stellate 30/11, 2010 at 19:26

I never saw a real comparison between top-down and shift-reduce parser :

just 2 small programs ran in the same time side-by-side, one using the top-down approach and the over one using the bottom-up approach, each of about ~200 lines of code,

able to parse any kind of custom binary operator and mathematical expression, the two sharing the same grammar declaration format, and then possibly adding variables declarations and affectations to show how 'hacks' (non context-free) can be implemented.

so, how to speak honestly of something we never did : comparing rigorously the two approach ?

Initiatory answered 12/11, 2015 at 5:40 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags