In lex, how do I differentiate between '-' (subtraction) operator and an integer '-3'?

Asked 13/2, 2016 at 23:44 Answered 14/2, 2016 at 3:19

I am writing lex for a specific language where operations are carried out in prefix notation :

(+ a b) --> (a + b)

An integer is defined as follows : An integer can have a negative sign (–) but no positive sign. It can be with or without space(s) between the sign and the digits. If the value is 0, only a single digit 0 is allowed. Otherwise, it is the same as common integer definitions (no leading 0’s).

Order of expressions in the lex is as follows ( first match rule) :

Regex for integer :[\-]?[ ]*((0)|([1-9][0-9]*))
Regex for subtraction operator : "-"

With these definitions, I would like to parse the string - 5 3 ie. (5-3)

Current output

Integer : - 5, 

Integer : 3

Desired output:

Operator : '-'

Integer : 5

Integer : 3

Helix answered 13/2, 2016 at 23:44 Comment(6)

Does your language also allow - as a unary operator? And is your language Lisp-like in that expressions must be parenthesized, or can you write + + 1 2 3? – Winou 13/2, 2016 at 23:52

In the language, - is not a unary operator. It is a simple language for now, so I dont think I will be encountering + + 1 2 3 like expressions. – Helix 13/2, 2016 at 23:56

If the parentheses are required around operators and forbidden around integers, then you can distinguish between unary minus and the minus which is part of an integer, because ( - 5) is not valid. In that case, the - which follows a ( is always an operator and the - which follows an operator or a value must be part of an integer. If those assumptions are not correct, then it's going to be a bit tricky. – Winou 14/2, 2016 at 0:2

I am afraid, those assumptions cannot be made in my context. I think, I will let the parser handle it – Helix 14/2, 2016 at 0:5

OK. I suspect you'll run into ambiguities, so good luck. For example, * - 5 - 3 1 could be (* (-5) (- 3 1)) or (* (- 5 (- 3)) 1) – Winou 14/2, 2016 at 0:9

@Helix It's a unary operator. Otherwise you don't have a problem. – Doerrer 14/2, 2016 at 3:33

You don't. You return - and INTEGER separately to the parser, and let the parser handle unary minus.

Doerrer answered 14/2, 2016 at 3:19 Comment(0)

The lexer does not have to do that: normally the parser (such as one written using yacc) gets tokens for the minus sign and the integer separately. The parser combines the two according to the rules which you provide.

For simple grammars, you can make the lexer do parsing, by using states (also known as start conditions). In your example, there would be states for the left/right parenthesis nesting. If your grammar allows a line-break between a minus sign and an integer, you would need a state to show that you have a minus sign.

Just to recognize an optionally-signed integer all on one line, you could do that with an expression like

[-]?[[:space:]]*[[:digit:]]+

However, your desired output does not combine the sign and the integer. So you would have separate regular expressions for those, e.g.,

[-]           { printf ("Operator: %s\n", yytext); }
0|([1-9][0-9]*)  { printf ("Integer: %s\n", yytext); }

Takara answered 13/2, 2016 at 23:52 Comment(1)

I figured that it is work of the parser, to differentiate the appropriate use of the "-" symbol. But, was just curious. Also, thanks for the regex. However it doesn't cover the leading 0's condition @Thomas Dickey – Helix 14/2, 2016 at 0:0

Recommended topics

Hot tags