Syntax tree empty nodes issue with Spirit Qi MiniC example

Asked 31/10, 2013 at 13:2 Answered 31/10, 2013 at 19:29

Solved c++parsing expression boost-spirit boost-spirit-qi

Dear Spirit Qi experts.

I have played around with the MiniC example in Spirit Qi, and have noticed an issue with "empty" AST nodes in the expression grammar. It will generate "expression" nodes that hold nothing but a single operand of type "expression" again.

I think the issue is due to the recursive rule definitions for the operator precedence:

// expression.hpp
    qi::rule<Iterator, ast::expression(), skipper<Iterator> >
        expr, equality_expr, relational_expr,
        logical_or_expr, logical_and_expr,
        additive_expr, multiplicative_expr
        ;

// expression_def.hpp
    expr =
        logical_or_expr.alias()
        ;

    logical_or_expr =
            logical_and_expr
        >> *(logical_or_op > logical_and_expr)
        ;

    logical_and_expr =
            equality_expr
        >> *(logical_and_op > equality_expr)
        ;

// ast.hpp

typedef boost::variant<
        nil
      , bool
      , unsigned int
      , identifier
      , boost::recursive_wrapper<unary>
      , boost::recursive_wrapper<function_call>
      , boost::recursive_wrapper<expression>
    >
operand;
/*...*/
struct expression
{
    operand first;
    std::list<operation> rest;
};

When logical_or_expr recurses into logical_and_expr, logical_and_expr will return an expression(). Due to the attribute propagation is Spirit, logical_or_expr then assigns this expression to it's expression's "first" member.

I think this is what generates all these "empty" expression nodes. I find that, however, nasty, and would like to get rid of them, but have not yet been successful. Has anyone found a decent solution to this before?

I'm thinking it would be somehow possible if an expression consisted of binary ops and unary ops. This would then also have the advantage of keeping the operation and operands in the same structure (pseudo-code):

struct binary_op
{
    optype type;
    operand op1;
    operand op2;
}

struct unary_op
{
    optype type;
    operand op1;
}

struct eval_op
{
    operand op1;
}

typedef boost::variant<binary_op,
                       unary_op,
                       eval_op>
expression;

typedef boost::variant<int,
                       bool,
                       boost::recursive_wrapper<expression> >
operand;

However, I believe I would still encounter this "empty node" issue after playing this out on paper. It seems like I'm chasing my own tail.

Does anyone have an idea on how to approach this issue?

EDIT

a > b

Will parse to:

    expression (from logical_or_op) // Empty expression (forward only)
(rest)  -/  \- (first) expression (from logical_and_op) // Empty expression (forward only)
            (rest)  -/  \- (first) expression (from equality_op) // Empty expression (forward only)
                    (rest)  -/ \- (first) expression (from relational_op) // "Correct" expression
                            (first) a   -/      \- (rest)
                                                    [0] operator_: op_greater
                                                        operand_: b

The desired output would be:

            expression (from relational_op)
(first) a   -/      \- (rest)
                        [0] operator_: op_greater
                            operand_: b

EDIT2

I have uploaded a modified mini-c version that prints the generated expression tree for an expression:

Link

If you look at the included A.avxh file, it contains an expression to parse. Set a breakpoint in main.cpp at line 67, and observe how often it recursively steps in there.

Veii answered 31/10, 2013 at 13:2 Comment(3)

why don't you name an example of what expression results in 'spurious' ast nodes, and what you'd expect? – Andriette 31/10, 2013 at 15:57

Fair enough :] I have tried to draw a simple tree to demonstrate what is happening to me. I hope this makes things clearer. I had noticed this when printing the tree indented, that higher-precedence operations are indented way to far and detected this. – Veii 31/10, 2013 at 16:16

In other words: Every evaluation in the grammar succeeds. For example, logical_and_op. This grammar always succeeds, and outputs an expression with "first" set to another expression and "rest" being an empty list. So for an int literal, there will be as many "empty" expressions as there are precedence levels. – Veii 31/10, 2013 at 16:25

This is because all the rules expose a "raw" ast::expression and this is a fixed type.

It has clearly been a choice for simplicity in this sample: the benefits are

By making all expression nodes the same type, the tree visitation code can be simple and uniform.
All rules have the same 'signature' and follow the same pattern. It's easy to reason about.
In this particular example (mini_c) the code-gen phase benefits by inheriting the same simplicity.

The usual way to have more flexible AST that follows the semantics more closely would be by making expression a variant instead: that way each expression can directly contain the actual "concrete" subexpression type instead of "wading" through intermediate levels of node types, just to mimic the grammar structure instead of the semantics.

I think I have several examples of such asts and corresponding grammars on this site, will see if I can link one later.

^{In fact, I think the typedef variant<...> statement in the same example (ast.hpp) is not a bad example of this approach.}

Relevant links:

Boost::Spirit Expression Parser

Boost::spirit how to parse and call c++ function-like expressions this being a question which I answered by providing a fully featured expression parser (with 'builtin function' evaluation), in two ways:

A 'bells and whistles' approach with a full AST (resembling what we're doing here with the mini_c sample expressions)

A 'pragmatic' approach which evaluates on-the-fly using just semantic actions - note this is optimal for storage, but has some downsides, which I address in that post

For now, if you don't wish to alter the grammar (so as not to "lose" the simplicity) you could instead do a transformation on the AST (a "simplify" pass on the expressions, so to speak).

I'm going to see what I can come up with in the next hour.

I've just refactored the grammar so that it doesn't result in such deep nesting.

First, let's reduce the test to a standalone test bed that just parses expressions and shows how a simple expression ("42") parses to a deeply nested AST: http://coliru.stacked-crooked.com/a/5467ca41b0ac1d03
```
<expr>
  ...
  <success></success>
  <attributes>[[[[[[[42, []], []], []], []], []], []]]</attributes>
</expr>
```
Next, let's remove the root problem: the grammar returns an invariant type (ast::expression) which is too heavy in many cases. Instead, we'd like to return an ast::operand (which is variant, and can contain the same ast::expression node): http://coliru.stacked-crooked.com/a/00e43b1f61db018c

Lastly we'd like all rules to become variant as well and returning either an expression iff there are trailing operations, or just a lone operand in the other case, in pseudo code:

logical_or_expr = 
      (logical_and_expr >> +(logical_or_op > logical_and_expr)
    | logical_and_expr
    ;

_{Note the subtle use if +(...) instead of *(...) to mandate at least one trailing logical_or operation}

Now, Spirit will have to be told how to assign the first branch to an operand attribute. qi::attr_cast<ast::expression>(...) should have been the fix here, but sadly I ran into undefined behaviour (this took the most time). I settled for a more verbose solution:

_logical_or_expr = logical_and_expr    >> +(logical_or_op     > logical_and_expr) ;
logical_or_expr  = _logical_or_expr | logical_and_expr;

As you can probably see, it's just the same, but with the first branch extracted to a separate rule, so we can just declare the rules to expose the desired attributes:

qi::rule<It, ast::operand(),    Skipper> logical_or_expr;
qi::rule<It, ast::expression(), Skipper> _logical_or_expr;

Indeed doing this for each precedence level of subexpressions, results in the following:

_logical_or_expr     = logical_and_expr    >> +(logical_or_op     > logical_and_expr) ;
_multiplicative_expr = unary_expr          >> *(multiplicative_op > unary_expr) ;
_additive_expr       = multiplicative_expr >> +(additive_op       > multiplicative_expr) ;
_relational_expr     = additive_expr       >> +(relational_op     > additive_expr) ;
_equality_expr       = relational_expr     >> +(equality_op       > relational_expr) ;
_logical_and_expr    = equality_expr       >> +(logical_and_op    > equality_expr) ;

logical_or_expr     = _logical_or_expr     | logical_and_expr        ;
logical_and_expr    = _logical_and_expr    | equality_expr           ;
equality_expr       = _equality_expr       | relational_expr         ;
relational_expr     = _relational_expr     | additive_expr           ;
additive_expr       = _additive_expr       | multiplicative_expr     ;
multiplicative_expr = _multiplicative_expr | attr_cast<ast::operand, ast::operand> (unary_expr) ;

The end result is here: http://coliru.stacked-crooked.com/a/8539757bb02fca34 (sadly it is just too much for Coliru), and it prints:

<expr>
  ...
  </logical_or_expr>
  <success></success>
  <attributes>[[42, []]]</attributes>
</expr>

CAVEAT Note that this adaptation will NOT make the parser more efficient! In fact, it will just result in a boatload of backtracking (the debug output counts 925 lines instead of just 45 in Step 1 (!!)).

Now, there will be some room to optimize using look-ahead assertions and/or semantic actions, but in general you will have to consider that optimizing for AST storage is going to cost CPU time.

Andriette answered 31/10, 2013 at 19:29 Comment(15)

I know it was hard to explain. Haha. Well my grammar will be (unfortunately) more complicated in the end, also with more precedence levels. I would much rather have binop_expr, unop_expr, and eval_expression/operands. However, I got stuck on that approach and hence looked at the minic and discovered this. – Veii 31/10, 2013 at 20:15

I also wanted to make an AST type for all expressions, but that doesn't seem necessary as they all fall into binop/unop categories. However, how would I incorporate the final "operand" type then? Is my understanding wrong that I would need recursive grammar expressions to support nesting? – Veii 31/10, 2013 at 20:17

I've just arrived at a refactoring of the grammar instead. Lemme add it to the answer. – Andriette 31/10, 2013 at 20:42

@Veii There. Phew. Putting that stuff on Coliru is always exceedingly painful. However, it's there now! I hope it helps you in some way. – Andriette 31/10, 2013 at 21:23

Wow. I'll have to read this a few times. Thank you so much for this idea! I've been sitting on this for a few days now. What sort of undefined behavior does attr_cast() invoke?? – Veii 31/10, 2013 at 21:30

@Veii attr_cast in itself likely doesn't invoke UB :) It must have been user error, I just didn't see it, so I 'bailed' and went for separate rules. Note I've just filled in the 'relevant links' with a comprehensive index to other posts that are doing basically the same thing. I have the suspicion you will actually find more helpful informations in the linked posts – Andriette 31/10, 2013 at 21:33

Haha "didn't see it" is so hard to believe with your spirit skill level. My solution was to iterate through and delete empty nodes (I couldn't attach a semantic action, as you did here: #8465469). Definitely not as elegant. I'll go through this, it may take me a little bit, figuring out how to make groups left or right associative now. I noticed I don't really need rules for all operators, just big rules for every precedence group. – Veii 31/10, 2013 at 21:41

Btw, what's with the "nil" struct in all examples? Is this only there to provide the variant with a default constructable type? – Veii 31/10, 2013 at 21:44

@Veii I too can get stuck sometimes, and my usual weapon of choice (valgrind --db-attach=yes) failed me because I have upgraded glibc probably a little too enthusiastically on my linux box :/ So I admitted defeat. Just this time :) – Andriette 31/10, 2013 at 21:45

let us continue this discussion in chat – Andriette 31/10, 2013 at 21:45

FWIW I have just discovered the source of the bug with attr_cast and here's a fixed version without the extra helper rules: coliru.stacked-crooked.com/a/debb5c3d7edd66ff. I know it's not exciting, but it solves the mystery :) And is 3 LoC shorter (actually 12, if you don't count the bug workaround as_expr) – Andriette 5/1, 2014 at 23:45

Thanks for the update, sehe! I hope to find this committed to the next/next-after boost version. I knew if someone could find it it'd be you or Hartmut Kaiser :) Did you just happen to stumble upon it or debugged for a while? – Veii 6/1, 2014 at 10:52

I debugged it for a while after someone else had a question about it on the mailing list. I have a (bad?) tendency to just workaround issues if it's for myself... :/ – Andriette 6/1, 2014 at 11:5

I suppose it just comes down to the "this if for principle now" mentality and the severity of design implications. I have a tendency to litter dev code with pragma messages of things that ought to be reworked because they're not optimal. – Veii 6/1, 2014 at 13:24

@Veii Yup. Me too. Then, at the end of the sprint, I weed out those that realistically aren't necessary – Andriette 6/1, 2014 at 14:0

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags