Recursive Descent Parser for something simple?

Asked 3/4, 2011 at 19:35 Answered 30/12, 2017 at 17:34

javascript parsing templates lexer tokenize

I'm writing a parser for a templating language which compiles into JS (if that's relevant). I started out with a few simple regexes, which seemed to work, but regexes are very fragile, so I decided to write a parser instead. I started by writing a simple parser that remembered state by pushing/popping off of a stack, but things kept escalating until I had a recursive descent parser on my hands.

Soon after, I compared the performance of all my previous parsing methods. The recursive descent parser was by far the slowest. I'm stuck: Is it worth using a recursive descent parser for something simple, or am I justified in taking shortcuts? I would love to go the pure regex route, which is insanely fast (almost 3 times faster than the RD parser), but is very hacky and unmaintainable to a degree. I suppose performance isn't terribly important because compiled templates are cached, but is a recursive descent parser the right tool for every task? I guess my question could be viewed as more of a philosophical one: to what degree is it worth sacrificing maintainability/flexibility for performance?

Unhallowed answered 3/4, 2011 at 19:35 Comment(2)

Take a look at EJS source. Thats a templating tool that seems to use a regex (presume that's used a lexer) & a parser. The source is only 500 lines so it shouldn't be too hard to get to grips with how this can be done. – Quadruplex 3/4, 2011 at 19:45

I'd be interested in helping figure out why your recursive descent parser is slow. With a good lexer (and that's pretty important ...) a recursive descent parser can be very fast. In a language like JavaScript, you can use constructor functions to represent the non-terminal symbols, and build the AST while parsing. It's an effective technique and I've written blazing fast parsers like that. Now making a good lexer in JavaScript might be hard ... – Pemberton 3/4, 2011 at 20:6

Recursive descent parsers can be extremely fast.

These are usually organized with a lexer, that uses regular expressions to recognize langauge tokens that are fed to the parser. Most of the work in processing the source text is done character-by-character by the lexer using the insanely fast FSAs that the REs are often compiled into.

The parser only sees tokens occasionally compared to the rate at which the lexer sees characters, and so its speed often doesn't matter. However, when comparing parser-to-parser speeds, ignoring time required to lex the tokens, recursive descent parsers can be very fast because they implement the parser stack using function calls which are already very efficient compared to general parser push-current-state-on-simulated-stack.

So, you can have you cake and eat it, too. Use regexps for the lexemes. Use the parser (any kind, recursive descent are just fine) to process lexemes. You should be pleased with the performance.

This approach also satisifies the observation made by other answers: write it in a way to make it maintainable. Lexer/Parser separation does this very nicely, I assure you.

Cathern answered 4/4, 2011 at 20:42 Comment(0)

Readability first, performance later...

So if your parser makes code more readable then it is the right tool

Enjoyment answered 3/4, 2011 at 19:45 Comment(1)

A parser generator + EBNF is a probably the most straightforward and readable way to create a parser. It's rarely used for 'production' parsers because the equivalent Recursive Descent Parser is significantly more efficient. While your advice is sound for general purpose programming, a parser's sole purpose is to consume and produce large quantities of data. Therfore, performance takes higher precedence than readability. – Duiker 30/12, 2017 at 17:44

to what degree is it worth sacrificing maintainability/flexibility for performance?

I think it's very important to write clear maintanable code as a first priority. Until your code not only indicates that it is a bottleneck, but that your application performance also suffers from it, you should always consider clear code to be the best code.

It's also important to not reinvent the wheel. The comment on taking a look at another parser is a very good one. There often found common solutions for writing routines such as this.

Recusion is very elegant when applied to something applicible. In my own experiance slow code due to recursion is an exception, not the norm.

Unto answered 6/4, 2011 at 0:41 Comment(0)

A Recursive Descent Parser should be faster

...or you're doing something wrong.

First off, your code should be broken into 2 distinct steps. Lexer + Parser.

Some reference examples online will first tokenize the entire syntax first into a large intermediate data structure, then pass that along to the parser. While good for demonstration; don't do this, it doubles time and memory complexity. Instead, as soon as a match is determined by the lexer, notify the parser of either a state transition or state transition + data.

As for the lexer. This is probably where you'll find your current bottleneck. If the lexer is cleanly separated from your parser, you can try wapping between Regex and non-Regex implementations to compare performance.

Regex isn't, by any means, faster than reading raw strings. It just avoids some common mistakes by default. Specifically, the unnecessary creation of string objects. Ideally, your lexer should scan your code and produce an output with zero intermediate data except the bare minimum required to track state within your parser. Memory-wise you should have:

Raw input (ie source)
Parser state (ex isExpression, isSatement, row, col)
Data (Ex AST, Tree, 2D Array, etc).

For instance, if your current lexer matches a non-terminal and copies every char over one-by one until it reaches the next terminal; you're essentially recreating that string for every letter matched. Keep in mind that string data types are immutable, concat will always create a new string. You should be scanning the text using pointer arithmetic or some equivalent.

To fix this problem, you need to scan from the startPos of the non-terminal to the end of the non-terminal and copy only when a match is complete.

Regex supports all of this by default out of the box, which is why it's a preferred tool for writing lexers. Instead of trying to write a Regex that parses your entire grammer, write one that only focuses on matching terminals & non-terminals as capture groups. Skip tokenization, and pass the results directly into your parser/state machine.

The key here is, don't try to use Regex as a state machine. At best it will only work for Regular (ie Chomsky Type III, no stack) declarative syntaxes -- hence the name Regular Expression. For example, HTML is a Context-Free (ie Chomsky Type II, stack based) declarative syntax, which is why a Rexeg alone is never enough to parse it. Your grammar, and generally all templating syntaxes, fall into this category. You've clearly hit the limit of Regex already, so you're on the right track.

Use Regex for tokenization only. If you're really concerned with performance, rewrite your lexer to eliminate any and all unnecessary string copying and/or intermediate data. See if you can outperform the Regex version.

The key being. The Regex version is easier to understand and maintain, whereas your hand-rolled lexer will likely be just a tinge faster if written correctly. Conventional wisdom says, do yourself a favor and prefer the former. In terms of Big-O complexity, there shouldn't be any difference between the two. They're two forms of the same thing.

Duiker answered 30/12, 2017 at 17:34 Comment(0)

Recommended topics

Hot tags