Introduction
To understand this correctly, one needs to realize that all modern compilers have two levels of recognizing the source language, the lexical level and the syntactical level.
The lexical level (the "lexer") splits the source code into tokens: literals (string/numeric/char), operators, identifiers, and other elements of the lexical grammar. These are the "words" and "punctuation characters" of the programming language.
The syntactical level (the "parser") is concerned with interpreting these low-level lexicals tokens into syntax, usually represented by syntax trees.
The lexer is the level that needs to know if a token is a "minus" token (-
) or an "decrement" (--
) token. (Whether the minus token is a unary or a binary minus, or whether the decrement token is a post or pre decrement token is determined at the syntactical level)
Things like precedence and left-to-right versus right-to-left only exist at the syntactical level. But whether a---b
is a -- - b
or a - -- b
is determined at the lexical level.
Answer
Why a---b
becomes a -- - b
is described in the Java Language Specification section 3.2 "Lexical Translations":
The longest possible translation is used at each step, even if the
result does not ultimately make a correct program while another
lexical translation would.
So the longest possible lexical token is formed.
In the case of a---b
, it makes the tokens a
, --
(longest) then the only possible next token -
, then b
.
In the case of a-----b
, it would be translated into a
, --
, --
, -
, b
, which is not grammatically valid.
To quote a bit further:
There are 3 steps in the lexical translation process, and in this case, the above applies to step 3 in this case:
A raw Unicode character stream is translated into a sequence of
tokens, using the following three lexical translation steps, which are
applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode
characters to the corresponding Unicode character. A Unicode escape of
the form \uxxxx, where xxxx is a hexadecimal value, represents the
UTF-16 code unit whose encoding is xxxx. This translation step allows
any program to be expressed using only ASCII characters.
A translation of the Unicode stream resulting from step 1 into a
stream of input characters and line terminators (§3.4).
A translation of the stream of input characters and line terminators
resulting from step 2 into a sequence of input elements (§3.5) which,
after white space (§3.6) and comments (§3.7) are discarded, comprise
the tokens (§3.5) that are the terminal symbols of the syntactic
grammar (§2.3).
("input elements" are "tokens")
java operator precedence
. – Metchnikoffa-----b
interprets as:((a--)--)-b
which is not legal. – Canaletto