Java expression interpretation rules of decrement/increment operators

Asked 17/3, 2016 at 12:55 Answered 17/3, 2016 at 13:17

Solved java syntax lexer decrement interpretation

This is a purely theoretical question, I wouldn't write this code normally, for clarity's sake.

Why is this quite ambiguous statement legal

int a = 1, b = 2;
int c = a---b; // a=0, b=2, c=-1

(it is interpreted as a-- -b)

and this one isn't?

int c = a-----b;

The first statement could also be interpreted as a- --b, while the second statement clearly has only 1 logical interpretation which would be a-- - --b.

Also another curious one:

int c = a--- -b; // a=0, b=2, c=3

(and int c = a----b; isn't a legal statement)

How is the expression interpretation defined in Java? I tried searching JLS, but haven't found an answer for this.

Fullrigged answered 17/3, 2016 at 12:55 Comment(4)

The magic words to search for are java operator precedence. – Metchnikoff 17/3, 2016 at 13:0

I guess a-----b interprets as: ((a--)--)-b which is not legal. – Canaletto 17/3, 2016 at 13:5

@KlasLindbäck No it isn't. The magic word is "lexer" rather than parser. The lexer is the level at which tokens (numbers, identifiers, operators, etc.) are recognized. Operator precedence doesn't come into play until the building of the parse tree. – Humankind 17/3, 2016 at 13:24

@ErwinBolwidt I stand corrected! – Metchnikoff 17/3, 2016 at 14:32

Introduction

To understand this correctly, one needs to realize that all modern compilers have two levels of recognizing the source language, the lexical level and the syntactical level.

The lexical level (the "lexer") splits the source code into tokens: literals (string/numeric/char), operators, identifiers, and other elements of the lexical grammar. These are the "words" and "punctuation characters" of the programming language.

The syntactical level (the "parser") is concerned with interpreting these low-level lexicals tokens into syntax, usually represented by syntax trees.

The lexer is the level that needs to know if a token is a "minus" token (-) or an "decrement" (--) token. (Whether the minus token is a unary or a binary minus, or whether the decrement token is a post or pre decrement token is determined at the syntactical level)

Things like precedence and left-to-right versus right-to-left only exist at the syntactical level. But whether a---b is a -- - b or a - -- b is determined at the lexical level.

Answer

Why a---b becomes a -- - b is described in the Java Language Specification section 3.2 "Lexical Translations":

The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would.

So the longest possible lexical token is formed.

In the case of a---b, it makes the tokens a, -- (longest) then the only possible next token -, then b.

In the case of a-----b, it would be translated into a, --, --, -, b, which is not grammatically valid.

To quote a bit further:

There are 3 steps in the lexical translation process, and in this case, the above applies to step 3 in this case:

A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:

A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.

A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (§3.4).

A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (§3.5) which, after white space (§3.6) and comments (§3.7) are discarded, comprise the tokens (§3.5) that are the terminal symbols of the syntactic grammar (§2.3).

("input elements" are "tokens")

Humankind answered 17/3, 2016 at 13:17 Comment(1)

That makes sense, Erwin. Looks like I made an incorrect assumption that java compiler will try to make some sense of the given expression, rather than use some simple rules to parse the expression even if it doesn't create a legal statement. – Fullrigged 17/3, 2016 at 13:30

To answer this question we'll have to take a look at Javas operators precendence.

The important rules for this example are:

Expressions are evaluated from left to right
The expr-- Postfix operator has a higher precendence than the a-b additive operator.

The expression a-----b will therefore evaluated like this: ((a--)--)-b which is illegal.

You could bypass those rules by using braces in the statement: (a--)-(--b) will be a legal statement.

Stifling answered 17/3, 2016 at 13:10 Comment(4)

a-----b won't compile – Insecure 17/3, 2016 at 13:11

@AndrewTobilko yes i know. In this post i'm trying to explain why it's not compiling... – Stifling 17/3, 2016 at 13:12

Operator precedence doesn't apply to the lexer. It ensures that a - b-- is interpreted as a - (b--) rather than (a - b)-- but it doesn't ensure that a---b is interpreted as a-- - b rather than a - --b. – Humankind 17/3, 2016 at 13:21

@ErwinBolwidt thats why i mentioned that expressions are evaluated from left to right – Stifling 17/3, 2016 at 13:23

Introduction

Answer

Recommended topics

Hot tags