Handling extra operators in Shunting-yard
Asked Answered
B

1

7

Given an input like this: 3+4+ Algorithm turns it in to 3 4 + +

I can find the error when it's time to execute the postfix expression. But, is it possible to spot this during the conversion?

(Wikipedia article and internet articles I've read do not handle this case)

Thank you

Burkley answered 5/5, 2013 at 0:29 Comment(4)
which algorithm does this conversion?Gnarly
en.wikipedia.org/wiki/Shunting-yard_algorithmBurkley
If you are using 3+4+ as an input for converting to postfix, your code should first do the validation and return invalid formatGnarly
@Gnarly Validating an infix expression is just as hard as conversion to postfix. Especially since i have single-input operators like ^ ! ~. Its like having 2 separate parsers to handle expressions.Burkley
S
28

Valid expressions can be validated with a regular expression, aside from parenthesis mismatching. (Mismatched parentheses will be caught by the shunting-yard algorithm as indicated in the wikipedia page, so I'm ignoring those.)

The regular expression is as follows:

PRE* OP POST* (INF PRE* OP POST*)*

where:

PRE  is a prefix operator or (
POST is a postfix operator or )
INF  is an infix operator
OP   is an operand (a literal or a variable name)

The Wikipedia page you linked includes function calls but not array subscripting; if you want to handle these cases, then:

PRE  is a prefix operator or (
POST is a postfix operator or ) or ]
INF  is an infix operator or ( or ,
OP   is an operand, including function names

One thing to note in the above is that PRE and INF are in disjoint contexts; that is, if PRE is valid, then INF is not, and vice versa. This implies that using the same symbol for both a prefix operator and an infix operator is unambiguous. Also, PRE and POST are in disjoint contexts, so you can use the same symbol for prefix and postfix operators. However, you cannot use the same symbol for postfix and infix operators. As examples, consider the C/C++ operators:

-  prefix or infix
+  prefix or infix
-- prefix or postfix
++ prefix or postfix

I implicitly used this feature above to allow ( to be used either as an expression grouper (effectively prefix) and as a function call (infix, because it comes between the function name and the argument list; the operator is "call".)

It's most common to implement that regular expression as a state machine, because there are only two states:

 +-----+            +-----+
 |State|-----OP---->|State|
 |  1  |<----INF----|  2  |
 |     |---+        |     |---+
 +-----+   |        +-----+   |
    ^     PRE          ^     POST
    |      |           |      |
    +------+           +------+

We could call State 1 "want operand" and State 2 "have operand". A simple implementation would just split the shunting yard algorithm as presented in wikipedia into two loops, something like this (if you don't like goto, it can be eliminated, but it really is the clearest way to present a state machine).

want_operand:
  read a token. If there are no more tokens, announce an error.
  if the token is an prefix operator or an '(':
    mark it as prefix and push it onto the operator stack
    goto want_operand
  if the token is an operand (identifier or variable):
    add it to the output queue
    goto have_operand
  if the token is anything else, announce an error and stop. (**)

have_operand:
  read a token
  if there are no more tokens:
    pop all operators off the stack, adding each one to the output queue.
    if a `(` is found on the stack, announce an error and stop.
  if the token is a postfix operator:
    mark it as postfix and add it to the output queue
    goto have_operand.
  if the token is a ')':
    while the top of the stack is not '(':
      pop an operator off the stack and add it to the output queue
    if the stack becomes empty, announce an error and stop.
    if the '(' is marked infix, add a "call" operator to the output queue (*)
    pop the '(' off the top of the stack
    goto have_operand
  if the token is a ',':
    while the top of the stack is not '(':
      pop an operator off the stack and add it to the output queue
    if the stack becomes empty, announce an error
    goto want_operand
  if the token is an infix operator:
    (see the wikipeda entry for precedence handling)
    (normally, all prefix operators are considered to have higher precedence
    than infix operators.)
    got to want_operand
  otherwise, token is an operand. Announce an error

(*) The procedure as described above does not deal gracefully with parameter lists;
    when the postfix expression is being evaluated and a "call" operator is found, it's
    not clear how many arguments need to be evaluated. It might be that function names
    are clearly identifiable, so that the evaluator can just attach arguments to the
    "call" until a function name is found. But a cleaner solution is for the "," handler
    to increment the argument count of the "call" operator (that is, the open
    parenthesis marked as "infix"). The ")" handler also needs to increment the
    argument count.

(**) The state machine as presented does not correctly handle function calls with 0
    arguments. This call will show up as a ")" read in the want_operand state and with
    a "call" operator on top of the stack. If the "call" operator is marked with an
    argument count, as above, then the argument count must be 0 when the ")" is read,
    and in this case, unlike the have_operand case, the argument count must not be
    incremented.
Stephanistephania answered 6/5, 2013 at 4:46 Comment(4)
Such an awesome post :). Regex is an interesting idea. It's possible to call not start of the parser, because each function call/variable/constructor may include separate expressions. That makes any state machine testing turn basically into a LR table. As a preliminary error system, i am counting two operands (+ * - / %) and checking if( #of2oper + 1 == #ofValue). That's possible because values are parsed separately with their of own parsers (function call/variable parser/constructor etc...). So i am going hybrid stack/recursive on this one :) . Thank you!Burkley
@mikbal: you probably can use an operator-precedence grammar for all values. The Wikipedia page shows how to do function calls, and I expand on the subject a bit; that should also work for constructors. Operator-precedence is indeed an LR technique; LALR(1) parsers turned out to be more powerful, but many languages can be parsed just as well with op-prec/shunting-yard/Pratt or other variants.Stephanistephania
This answer goes beyond the specific error mentioned in the question. It presents a mechanism for validating the input to the algorithm is valid infix notation via the state machine.Cotangent
Thanks for such a beautiful and clear explanation. This should be linked to the official wikipedia page for its details and simplicity. 7 years on, and this is still the most informative post.Propinquity

© 2022 - 2024 — McMap. All rights reserved.