It seems that the Python tokenizer isn't responsible for the explicit line joining. I mean if we write the following code in file script.py
:
"one \
two"
and then type python -m tokenize script.py
in the command prompt, we will got the following table:
0,0-0,0: ENCODING 'utf-8'
1,0-2,4: STRING '"one \\\ntwo"'
2,4-2,5: NEWLINE '\n'
3,0-3,0: ENDMARKER ''
This means that the second token contains the string '"one \\\ntwo"'
. But I expected it to be the string '"one two"'
instead. So what thing actually handles the explicit line joining in Python? Maybe the Python parser does it?
Or maybe there is some separate "evaluation stage" between the tokenization and parsing stages, where each token string is transformed into a more convenient representation for parsing (this stage is responsible for explicit line joining)?
Wikipedia has some words about the evaluation stage in the general programming context (i.e. it doesn't talk specifically about Python), but this description is not very clear for me, and I couldn't come up with any more examples (besides the explicit line joining in string literal) when such a potential evaluation is necessary in Python. And the existence of such evaluation stage would mean that the parser doesn't get its input directly from the tokenizer...
I also tested how the Python tokenizer treats the explicit line joining outside the string literals. I entered
2 + \
3
And got the following tokens:
0,0-0,0: ENCODING 'utf-8'
1,0-1,1: NUMBER '2'
1,2-1,3: OP '+'
2,0-2,1: NUMBER '3'
2,1-2,2: NEWLINE '\n'
3,0-3,0: ENDMARKER ''
In this case the tokenizer simply removed the \
symbol, as I expected.
"one \\\ntwo"
– Polito