I don't think there is any consensus to the question of whether 42foo
should be recognised as an invalid number or as two tokens. It's a question of style and both usages are common in well-known languages.
For example:
$ python -c 'print 42and False'
False
$ lua -e 'print(42and false)'
lua: (command line):1: malformed number near '42a'
$ perl -le 'print 42and 0'
42
# Not an idiosyncracy of tcc; it's defined by the standard
$ tcc -D"and=&&" -run - <<<"main(){return 42and 0;}"
stdin:1: error: invalid number
# gcc has better error messages
$ gcc -D"and=&&" -x c - <<<"main(){return 42and 0;}" && ./a.out
<stdin>: In function ‘main’:
<stdin>:1:15: error: invalid suffix "and" on integer constant
<stdin>:1:21: error: expected ‘;’ before numeric constant
$ ruby -le 'print 42and 1'
42
# And now for something completely different (explained below)
$ awk 'BEGIN{print 42foo + 3}'
423
So, both possibilities are in common use.
If you're going to reject it because you think a number and a word should be separated by whitespace, you should reject it in the lexer. The parser cannot (or should not) know whether whitespace separates two tokens. Independent of the validity of 42and
, the fragments 42 + 1
, 42+1
, and 42+ 1
) should all be parsed identically. (Except, perhaps, in Fortress. But that was an anomaly.) If you don't mind shoving numbers and words together, then let the parser reject it if (and only if) it is a syntax error.
As a side note, in C and C++, 42and
is initially lexed as a "preprocessor number". After preprocessing, it needs to be relexed and it is at that point that the error message is produced. The reason for this odd behaviour is that it is completely legitimate to paste together two fragments to produce a valid number:
$ gcc -D"c_(x,y)=x##y" -D"c(x,y)=c_(x,y)" -x c - <<<"int main(){return c(12E,1F);}"
$ ./a.out; echo $?
120
Both 12E
and 1F
would be invalid integers, but pasted together with the ##
operator, they form a perfectly legitimate float. The ##
operator only works on single tokens, so 12E
and 1F
both need to lexed as single tokens. c(12E+,1F)
wouldn't work, but c(12E0,1F)
is also fine.
This is also why you should always put spaces around the +
operator in C: classic trick C question: "What is the value of 0x1E+2
?"
Finally, the explanation for the awk line:
$ awk 'BEGIN{print 42foo + 3}'
423
That's lexed by awk as BEGIN{print 42 foo + 3}
which is then parsed as though it had been written BEGIN{print (42)(foo + 3);}
. In awk, string concatenation is written without an operator, but it binds less tightly than any arithmetic operator. Consequently, the usual advice is to use explicit parentheses in expressions which involve concatenation, unless they are really simple. (Also, undefined variables are assumed to have the value 0
if used arithmetically and ""
if used as strings.)