A Recursive Descent Parser should be faster
...or you're doing something wrong.
First off, your code should be broken into 2 distinct steps. Lexer + Parser.
Some reference examples online will first tokenize the entire syntax first into a large intermediate data structure, then pass that along to the parser. While good for demonstration; don't do this, it doubles time and memory complexity. Instead, as soon as a match is determined by the lexer, notify the parser of either a state transition or state transition + data.
As for the lexer. This is probably where you'll find your current bottleneck. If the lexer is cleanly separated from your parser, you can try wapping between Regex and non-Regex implementations to compare performance.
Regex isn't, by any means, faster than reading raw strings. It just avoids some common mistakes by default. Specifically, the unnecessary creation of string objects. Ideally, your lexer should scan your code and produce an output with zero intermediate data except the bare minimum required to track state within your parser. Memory-wise you should have:
- Raw input (ie source)
- Parser state (ex isExpression, isSatement, row, col)
- Data (Ex AST, Tree, 2D Array, etc).
For instance, if your current lexer matches a non-terminal and copies every char over one-by one until it reaches the next terminal; you're essentially recreating that string for every letter matched. Keep in mind that string data types are immutable, concat will always create a new string. You should be scanning the text using pointer arithmetic or some equivalent.
To fix this problem, you need to scan from the startPos of the non-terminal to the end of the non-terminal and copy only when a match is complete.
Regex supports all of this by default out of the box, which is why it's a preferred tool for writing lexers. Instead of trying to write a Regex that parses your entire grammer, write one that only focuses on matching terminals & non-terminals as capture groups. Skip tokenization, and pass the results directly into your parser/state machine.
The key here is, don't try to use Regex as a state machine. At best it will only work for Regular (ie Chomsky Type III, no stack) declarative syntaxes -- hence the name Regular Expression. For example, HTML is a Context-Free (ie Chomsky Type II, stack based) declarative syntax, which is why a Rexeg alone is never enough to parse it. Your grammar, and generally all templating syntaxes, fall into this category. You've clearly hit the limit of Regex already, so you're on the right track.
Use Regex for tokenization only. If you're really concerned with performance, rewrite your lexer to eliminate any and all unnecessary string copying and/or intermediate data. See if you can outperform the Regex version.
The key being. The Regex version is easier to understand and maintain, whereas your hand-rolled lexer will likely be just a tinge faster if written correctly. Conventional wisdom says, do yourself a favor and prefer the former. In terms of Big-O complexity, there shouldn't be any difference between the two. They're two forms of the same thing.