Why is my recursive descent parser right-associative
Asked Answered
C

1

5

I'm writing my own programming language, and I have the tokenizer (lexer) all done. But for parsing, I'm having trouble writing a recursive descent parser. It seems to be right associative, when it should be left, and I don’t know why. For example, it parses 1-2-3 as 1-(2-3), not the correct (1-2)-3.

I've trimmed off most of the other code leaving just what's reproducible:

using System.Collections.Generic;

namespace Phi
{
    public enum TokenType
    {
        Plus, // '+'
        Minus, // '-'
        IntegerLiteral,
    }

    public interface INode
    {
        // Commented out as they aren't relevant
        //NodeType GetNodeType();
        //void Print(string indent, bool last);
    }

    class Program
    {
        static void Main(string[] args)
        {
            List<Token> tokens = new List<Token>()
            {
                new Token(TokenType.IntegerLiteral, "1"),
                new Token(TokenType.Minus, ""),
                new Token(TokenType.IntegerLiteral, "2"),
                new Token(TokenType.Minus, ""),
                new Token(TokenType.IntegerLiteral, "3"),
            };

            int consumed = ParseAdditiveExpression(tokens, out INode root);
        }

        private static int ParseAdditiveExpression(List<Token> block, out INode node)
        {
            // <additiveExpr> ::= <multiplicativeExpr> <additiveExprPrime>
            int consumed = ParseMultiplicataveExpression(block, out INode left);
            consumed += ParseAdditiveExpressionPrime(GetListSubset(block, consumed), out INode right);

            if (block[1].Type == TokenType.Plus)
                node = (right == null) ? left : new AdditionNode(left, right);
            else
                node = (right == null) ? left : new SubtractionNode(left, right);
            return consumed;
        }
        private static int ParseAdditiveExpressionPrime(List<Token> block, out INode node)
        {
            // <additiveExprPrime> ::= "+" <multiplicataveExpr> <additiveExprPrime>
            //                     ::= "-" <multiplicativeExpr> <additiveExprPrime>
            //                     ::= epsilon
            node = null;
            if (block.Count == 0)
                return 0;
            if (block[0].Type != TokenType.Plus && block[0].Type != TokenType.Minus)
                return 0;

            int consumed = 1 + ParseMultiplicataveExpression(GetListSubset(block, 1), out INode left);
            consumed += ParseAdditiveExpressionPrime(GetListSubset(block, consumed), out INode right);

            if (block[0].Type == TokenType.Plus)
                node = (right == null) ? left : new AdditionNode(left, right);
            else
                node = (right == null) ? left : new SubtractionNode(left, right);
            return consumed;
        }

        private static int ParseMultiplicataveExpression(List<Token> block, out INode node)
        {
            // <multiplicativeExpr> ::= <castExpr> <multiplicativeExprPrime>
            // unimplemented; all blocks are `Count == 1` with an integer
            node = new IntegerLiteralNode(block[0].Value);
            return 1;
        }

        private static List<T> GetListSubset<T>(List<T> list, int start)
        {
            return list.GetRange(start, list.Count - start);
        }
    }
}

As for the definition of AdditionNode, SubtractionNode, and MultiplicationNode, they are all the same and only serve semantic purposes. For brevity, here's just AdditionNode:

namespace Phi
{
    public class AdditionNode : INode
    {
        public AdditionNode(INode left, INode right)
        {
            Left = left;
            Right = right;
        }

        public INode Left { get; }
        public INode Right { get; }

        // Print and GetNodeType have been removed as they aren't relevant
    }
}

As for my problem, when I run Phi.Program, as said at the beginning, it is parsing with the wrong associativity. Here is root after ParseAdditiveExpression completes:

enter image description here enter image description here enter image description here

As you can see, it groups the 2 with the 3, not the 1. Why is it doing this?

Chuu answered 13/6, 2018 at 21:42 Comment(16)
The question is confusing. You wrote the code so asking why it does something is because that's what you wrote it to do. If you didn't want it to be right associative then why did you write a parser that parses binary operators as right associative?Phonetician
@EricLippert The thing is is that I don't know why it's right associativeChuu
Solve the problem by solving a simpler problem. Here's a language { "1", "1+1", "1+1+1", "1+1+1+1", ...}. You already have the lexer. Now write me two parsers, one which parses to the right and one which parses to the left. Then you will know the difference between a parser that parses right and a parser that parses left.Phonetician
Another technique you can use is one of preconditions and postconditions. You have only four relevant methods. You believe something about what the state of the token consumption is before they are called and after they are called. Say what those beliefs are, and write Debug.Assert calls that detect whether those beliefs are false. One of your beliefs is wrong, and that's where the bug is. The assertion will tell you where it is.Phonetician
Briefly though: to parse to the right, you parse an expression by parsing a term -- the smallest possible thing that can be on the left side of an operator. Then you check to see if there is an operator. If there is not, then the term is the expression. If there is, then you parse the expression on the right of the operator, and the resulting expression is (the term on the left, the operator, the expression on the right). That's the algorithm for parsing a right-associative operator. Now, can you say what is the algorithm for parsing a left-associative operator?Phonetician
If you look at your debug the first entry in a Node is always LEFT. It is reversing the order of the terms so you are going from last to first.Mamiemamma
Put another way: T ::= 1, E ::= T | T + E is the grammar for a right-associative operator. T ::= 1, E ::= T | E + T is the grammar for a left-associative operator. Can you write parsers for both those grammars?Phonetician
@EricLippert the thing with E ::= E + T is it’s left recursive. From my understanding, recursive descent parsers don’t support left recursion. So I changed it to E ::= T E’ and E’ ::= + T E’Chuu
That's the correct grammar transformation. (Though don't forget E' ::= nil needs to be in there.) But now how are you going to build your desired parse tree from a parser for that grammar? You want a parse tree where there are expressions on the left of every + and terms on the right, but what you've got is an expression is a pair of (term, suffix), and a suffix is either nil, or a triple of (+, term, suffix). How are you going to transform that thing into the tree you want?Phonetician
Hint: do you know the shunting yard algorithm?Phonetician
@EricLippert yesChuu
@ColeJohnson: Have you implemented it correctly?Phonetician
@EricLippert yesChuu
@ColeJohnson: Then I still don't know what your question is. You're familiar with the algorithm for parsing left-associative binary operators, you say you've implemented it correctly, so what is your specific question? I'm happy to help people who are designing languages because that's an awesome thing to do, but help me help you here. What's your question?Phonetician
I’ve tried implementing the parser as the grammar says, but I’ve been pulling my hair out trying to make it parse correctly. I’m sure I’ve implemented the grammar wrong, but I can’t see whereChuu
The problem is in this line: node = (right == null) ? left : new AdditionNode(left, right); That is not at all the right way to turn an AdditivePrime into an AdditionaNode. Like I said in a comment above: you have to transform this into the thing you want, but you have not done so. I'll write up an answer with some good advice.Phonetician
P
23

As I noted in a comment, the problem is that you have confused the rightmost child of a binary operator with the rightmost child of an additiveprime. The rightmost child of a binary operator is an expression. The rightmost side of an additiveprime is an additiveprime, so on "tree node type" grounds alone we must conclude that you have built a wrong parse tree.

Keeping track of the "logical type" of every parse artifact is a powerful technique for finding bugs in parsers. Another that I like, which is tragically underused, is attribute every token in the program to exactly one parse tree node. If you did that then you would quickly realize that the token for the operator is logically in two places: in the binary operator, and in its rightmost child. This also tells us that something is wrong.

What doesn't help is that your infrastructure for parsing is a godawful mess of passing around numbers and out parameters. Your parser lacks discipline. Your parser code looks like counting tokens is the most important thing that a parser does, and that everything else is incidental.

Parsing is a very crisp problem, and parser methods should do one thing, one thing only, and do it perfectly. The structure of the parser, and the structure of each method, should directly reflect the grammar being parsed. There should be almost no arithmetic on integers in the parser, since parsing is about building a parse tree, not about counting tokens.

I build recursive descent parsers for a living. Let me show you how I would build this parser, were I building it quickly for my own purposes. (Were I building it for a production application it would be different in many respects, but we're going for easy to understand here.)


All right, let's get started. First thing is: when you are stuck on a problem, solve a simpler problem. Let's simplify the problem in the following manner:

  • Assume that the token stream is a well-formed program. No error detection.
  • Tokens are strings.
  • The grammar is : E ::= T E', E' ::= + T E' | nil, and T is a term consisting of a single token.

All right. Start by creating types that represent each of those things.

sealed class Term : ParseTree 
{
    public string Value { get; private set; }
    public Term(string value) { this.Value = value; }
    public override string ToString() { return this.Value; }
}
sealed class Additive : ParseTree 
{ 
    public ParseTree Term { get; private set; }
    public ParseTree Prime { get; private set; }
    public Additive(ParseTree term, ParseTree prime) {
        this.Term = term;
        this.Prime = prime;
    }
    public override string ToString() { return "" + this.Term + this.Prime; }
}
sealed class AdditivePrime : ParseTree 
{ 
    public string Operator { get; private set; }
    public ParseTree Term { get; private set; }
    public ParseTree Prime { get; private set; }
    public AdditivePrime(string op, ParseTree term, ParseTree prime) {
        this.Operator = op;
        this.Term = term;
        this.Prime = prime;
    }
    public override string ToString() { return this.Operator + this.Term + this.Prime; }
}
sealed class Nil : ParseTree 
{
    public override string ToString() { return ""; }
}

Notice a few things:

  • Abstract classes are abstract.
  • Concrete classes are sealed.
  • Everything is immutable.
  • Everything knows how to print itself.
  • No nulls! NO NULLS. Nulls cause crashes. You have a production called nil, so make a type called Nil to represent it.

Next: What do we want the parser to look like from the user's perspective? We want a sequence of tokens to go in, and we want a parse tree to come out. Great. So the public surface should be:

sealed class Parser
{
    public Parser(List<string> tokens) { ... }
    public ParseTree Parse() { ... }
}

And if we've done everything right then the call site looks like this:

public static void Main()
{
    var tokens = new List<string>() { "1" , "+" , "2" , "+" , "3" , "+" , "4"};
    var parser = new Parser(tokens);
    var result = parser.Parse();
    System.Console.WriteLine(result);
}

Super. Now all we have to do is implement it.

A parser keeps track of the list of tokens and the current token under consideration. Do not tramp that information around from method to method. It is logically part of the parser, so keep it in the parser.

public sealed class Parser
{
    private List<string> tokens;
    private int current;    
    public Parser(List<string> tokens)
    {
        this.tokens = tokens;
        this.current = 0;
    }

The language consists right now only of additive expressions, so:

    public ParseTree Parse()
    {
        return ParseAdditive();
    }

Great, we are done two members of the parser already. Now, what does ParseAdditive do? It does what it says on the tin. It parses an additive expression, which has the grammar E:: T E', so that is what it does and ALL that it does, for now.

private ParseTree ParseAdditive()
{
    var term = ParseTerm();
    var prime = ParseAdditivePrime();
    return new Additive(term, prime);
}

If your parser methods do not look incredibly simple then you are doing something wrong. The whole point of recursive descent parsers is that they are easy to understand and easy to implement.

Now we can see how to implement ParseTerm(); it just consumes a token:

private string Consume() 
{
  var t = this.tokens[this.current];
  this.current += 1;
  return t;
}
private ParseTree ParseTerm() {
  return new Term(Consume());
}

Again, we are assuming that the token stream is well formed. Of course this would crash if it was ill-formed, but that's a problem for another day.

And finally, the last one is a little harder because there are two cases.

private bool OutOfTokens() 
{
  return this.current >= this.tokens.Count;
}
private ParseTree ParseAdditivePrime()
{
    if (OutOfTokens())
        return new Nil();
    var op = Consume();
    var term = ParseTerm();
    var prime = ParseAdditivePrime();
    return new AdditivePrime(op, term, prime);
}

So simple. Again, all your methods should look like exactly what they do.

Notice that I did NOT write

private ParseTree ParseAdditivePrime()
{
    if (this.current >= this.tokens.Count)
        return new Nil();

Keep the text of the program reading like the operations it is implementing. What do we want to know? Are we out of tokens? So say that. Don't make the reader -- yourself -- have to think about oh, did I mean > or < or <= or... just don't. Solve the problem once, put the solution in a method with a good name, and then call the method. Your future self will thank your past self for taking that care.

Notice also that I did not write the C# 7 super slick:

private ParseTree ParseAdditivePrime() => 
  OutOfTokens() ? new Nil() : new AdditivePrime(Consume(), ParseTerm(), ParseAdditivePrime());

That you can write your parser methods as one-liners is a sign that you've designed a good parser, but that doesn't mean that you should. It's often easier to understand and debug a parser if you keep it in sequential statement form rather than slick little one-liners. Exercise good judgment.


OK, we have solved the easier problem! Now let's solve a slightly harder problem. We have solved parsing the grammar E ::= T E', E' ::= + T E' | nil, but the grammar we want to parse is B :: T | B + T.

Notice that I am not confusing E, which is a term-and-suffix, with B, which is either a T or a B, a +, and a T. Since B and E are different, represent them by different types.

Let's make a type for B:

sealed class Binary : ParseTree 
{
    public ParseTree Left { get; private set; }
    public string Operator { get; private set; }
    public ParseTree Right { get; private set; }
    public Binary(ParseTree left, string op, ParseTree right) 
    {
        this.Left = left; 
        this.Operator = op;
        this.Right = right;
    }
    public override string ToString() 
    {
        return "(" + Left + Operator + Right + ")";
    }
}

Note that I have added parentheses in the output as a visual aid to help us see that it is left associative.

Now, suppose we have an Additive in hand and we need a Binary. How are we going to do that?

An additive is always a term and a prime. So there are two cases. Either the prime is nil, or it is not.

If the prime is nil then we're done: a Term is acceptable where a Binary is required, so we can just pass back the term.

If the prime is not nil then the prime is op, term, prime. Somehow we have to get a Binary out of that. A binary needs three things. Remember, we are attributing every token to exactly one node, so this will help us figure it out.

  • We have our left term from the additive.
  • We have our op from the prime.
  • We have our right term from the prime.

But that leaves the prime's prime behind! We need to do something with that. Let's reason it out. Again, there are two cases:

  • If the prime's prime is nil, then the result is the binary.
  • If the prime's prime is not nil, then the result is a new binary with the old binary on the left, and... wait a minute, this is the same algorithm that we just described for transforming an additive into a binary.

We have just discovered that this algorithm is recursive, and once you realize that it is trivial to write:

private static ParseTree AdditiveToBinary(ParseTree left, ParseTree prime) 
{
    if (prime is Nil) return left;
    var reallyprime = (AdditivePrime) prime;
    var binary = new Binary(left, reallyprime.Operator, reallyprime.Term);
    return AdditiveToBinary(binary, reallyprime.Prime);
}

And now we modify our ParseAdditive:

private ParseTree ParseAdditive()
{
    var term = ParseTerm();
    var prime = ParseAdditivePrime();
    return AdditiveToBinary(term, prime);       
}     

And run it:

(((1+2)+3)+4)

And we're done.

Well, not quite. ParseAdditive no longer does what it says on the tin! It says ParseAdditive but it returns a Binary.

In fact... do we need Additive at all? Could we eliminate it entirely from the parser? In fact we could; we now never create an instance of Additive, so it can be deleted, and ParseAdditive can be renamed ParseBinary.

This often happens when constructing programs using the technique of "solve a simpler problem". You end up being able to throw away your earlier work, which is great. Deleted code has no bugs.

  • Exercise: Representing operators as strings is gross. Make an Operator class analogous to Term, and a method that parses operators. Keep raising the level of abstraction of the parser away from concrete strings and into the business domain of the parser. Similarly with tokens; they ought not to be strings.
  • Exercise: We've solved a slightly harder problem, so keep going. Can you now add multiplication?
  • Exercise: Can you parse a language that has a mix of left and right associative operators?

Some additional thoughts:

  • I presume you are doing this either for your own fun, or for a school assignment. Do not copy paste my work into your assignment. That's academic fraud. Remember to properly attribute all work when you submit it if that work is not entirely your own.

  • If you're doing it for fun, have fun designing a language! It's a nice hobby, and if you're really lucky, someday someone will pay you to do it.

  • You're designing your own language, so you do not have to repeat the mistakes of the past. I notice for example that your comments suggest that you're going to add cast expressions. Welcome to a world of pain if you do it like C, C++, C#, Java, and so on. All those languages must have their parsers disambiguate between (x)+y meaning "apply unary plus to y and cast the thing to x", and "add quantity (x) to y". It is a major pain. Consider a better syntax for casts, like an as operator. Also, examine the meaning of a cast; in C# a cast means both "produce an instance of a different type that represents the same value" and "I am asserting that the runtime type of this thing differs from its compile time type; throw if I am wrong". Those operations are completely different but they have the same syntax. All languages are responses to previous languages, so think hard about what you like because it is familiar vs because it is good.

Phonetician answered 15/6, 2018 at 18:39 Comment(9)
Wow! That makes so much more sense! Thanks so much for the help! And for your 3rd to last bullet point, this is just a personal projectChuu
@ColeJohnson: You're welcome! What are the characteristics of your new language? Are you building it for a particular purpose, or as a learning exercise, or both?Phonetician
I never really liked PHP's dynamic typing, so it's basically a strongly typed alternative to PHP. Just for fun and to learnChuu
You know that's what I do for a living, right? I spent most of 2016 and 2017 rewriting the parser for Hack, a language that is an optionally-statically-typed / gradually-typed alternative to PHP. If you are interested to see how I built the parser, well, you'll have to learn to read OCAML, but it is not a hard language once you get past the module system. The code is here: github.com/facebook/hhvm/tree/master/hphp/hack/src/parserPhonetician
Note that the Hack parser is slightly unusual in a couple of ways. First, it is full fidelity. The syntax tree it produces attributes every single character in the file to a unique place in the parse tree. This facilitates writing code formatters and so on, because we do not lose track of comments and whitespace.Phonetician
Second, it is written in pure functional style. Every parse operation returns a parse tree and a new parser that can parse the next production. That way you can recover from error situations by saying "OK, what happens if we parse this as a statement? What if we parse it as a class?..." and we don't have to worry about having to "undo" the changes to the parser because there are no changes to a parser. You just get a new parser and the old one still works.Phonetician
If you are interested in type systems, Hack uses flow based implicit static typing with union and intersection types. The type system code is rather harder to understand than the parser though.Phonetician
Also FYI the operator precedence in PHP is insane. I would not attempt to replicate it. In the Hack parser I broke with PHP and implemented sensible operator precedence.Phonetician
Oh I know that. I want to implement the precedence sanely like C# doesChuu

© 2022 - 2024 — McMap. All rights reserved.