Attribute access on int literals
Asked Answered
C

3

1
>>> 1 .__hash__()
1
>>> 1.__hash__()
  File "<stdin>", line 1
    1.__hash__()
             ^
SyntaxError: invalid syntax

It has been covered here before that the second example doesn't work because the int literal is actually parsed as a float.

My question is, why doesn't python parse this as attribute access on an int, when the interpretation as a float is a syntax error? The docs section on lexical analysis seem to suggest whitespace only required when other interpretations are ambiguous, but perhaps I'm reading this section wrong.

On a hunch it seems like the lexer is greedy (trying to take the biggest token possible), but I have no source for this claim.

Coping answered 8/10, 2014 at 11:23 Comment(0)
E
2

Read carefully, it says

Whitespace is needed between two tokens only if their concatenation could otherwise be interpreted as a different token (e.g., ab is one token, but a b is two tokens).

1.__hash__() is tokenized as:

import io, tokenize
for token in tokenize.tokenize(io.BytesIO(b"1.__hash__()").read):
    print(token.string)

#>>> utf-8
#>>> 1.
#>>> __hash__
#>>> (
#>>> )
#>>>

Python's lexer will choose a token which comprises the longest possible string that forms a legal token, when read from left to right; after parsing no two tokens should be able to be combined into a valid token. The logic is very similar to that in your other question.

The confusion seems to be not recognizing the tokenizing step as a completely distinct step. If the grammar allowed splitting up tokens solely to make the parser happy then surely you'd expect

_ or1.

to tokenize as

_
or
1.

but there is no such rule, so it tokenizes as

_
or1
. 
Egret answered 8/10, 2014 at 13:47 Comment(0)
V
3

The lexer is very simple, and will not backtrack. Language parsers are often divided into a lexing phase and a parsing phase, or a lexer and a parser. The lexer breaks the character stream into tokens, and then the parser determines a program structure from the tokens. The lexer sees four tokens: 1., __hash__, (, ): float, identifier, open-paren, close-paren. The parser can't make sense of those tokens, but that doesn't mean the lexer will try to lex the characters differently.

Varipapa answered 8/10, 2014 at 11:49 Comment(0)
A
3

It’s simply a matter of definition; for languages the grammar does the job.

Attribute references are defined at a much broader level than floating point literals. So from a grammar level, the parser has to recognize 1. as a floating point literal and not as a attribute reference.

Of course, the parser itself could backtrack when reaching the _ and try to figure out that it’s not a floating point literal but an attribute reference instead. However, since CPython’s parser is a LL(1) parser backtracking is not an option. As such, the grammar would have to be changed a lot to allow the parser to recognize this (although I’m not sure right now if it’s even possible with a LL(1) parser). We could also change Python’s parser to something else, maybe one that does backtrack, but doing so is not only a very difficult task (it also would require to change the grammar) but would increase the complexity of the parsing process a lot (and with that likely decrease the speed).

So maybe it would be possible, but it would require major changes in the language specification. And that alone would be problematic. It also would break existing code that make use of this early float recognition, e.g. 1.if True else 0.

Amortize answered 8/10, 2014 at 11:50 Comment(0)
E
2

Read carefully, it says

Whitespace is needed between two tokens only if their concatenation could otherwise be interpreted as a different token (e.g., ab is one token, but a b is two tokens).

1.__hash__() is tokenized as:

import io, tokenize
for token in tokenize.tokenize(io.BytesIO(b"1.__hash__()").read):
    print(token.string)

#>>> utf-8
#>>> 1.
#>>> __hash__
#>>> (
#>>> )
#>>>

Python's lexer will choose a token which comprises the longest possible string that forms a legal token, when read from left to right; after parsing no two tokens should be able to be combined into a valid token. The logic is very similar to that in your other question.

The confusion seems to be not recognizing the tokenizing step as a completely distinct step. If the grammar allowed splitting up tokens solely to make the parser happy then surely you'd expect

_ or1.

to tokenize as

_
or
1.

but there is no such rule, so it tokenizes as

_
or1
. 
Egret answered 8/10, 2014 at 13:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.