What thing is responsible for the explicit line joining?
Asked Answered
E

0

7

It seems that the Python tokenizer isn't responsible for the explicit line joining. I mean if we write the following code in file script.py:

"one \
two"

and then type python -m tokenize script.py in the command prompt, we will got the following table:

0,0-0,0:            ENCODING       'utf-8'
1,0-2,4:            STRING         '"one \\\ntwo"'
2,4-2,5:            NEWLINE        '\n'
3,0-3,0:            ENDMARKER      ''

This means that the second token contains the string '"one \\\ntwo"'. But I expected it to be the string '"one two"' instead. So what thing actually handles the explicit line joining in Python? Maybe the Python parser does it?
Or maybe there is some separate "evaluation stage" between the tokenization and parsing stages, where each token string is transformed into a more convenient representation for parsing (this stage is responsible for explicit line joining)?

Wikipedia has some words about the evaluation stage in the general programming context (i.e. it doesn't talk specifically about Python), but this description is not very clear for me, and I couldn't come up with any more examples (besides the explicit line joining in string literal) when such a potential evaluation is necessary in Python. And the existence of such evaluation stage would mean that the parser doesn't get its input directly from the tokenizer...

I also tested how the Python tokenizer treats the explicit line joining outside the string literals. I entered

2 + \
3

And got the following tokens:

0,0-0,0:            ENCODING       'utf-8'
1,0-1,1:            NUMBER         '2'
1,2-1,3:            OP             '+'
2,0-2,1:            NUMBER         '3'
2,1-2,2:            NEWLINE        '\n'
3,0-3,0:            ENDMARKER      ''

In this case the tokenizer simply removed the \ symbol, as I expected.

Elwood answered 20/5 at 12:12 Comment(9)
Here, the \\\n is preserved in the token, indicating the presence of a line continuation. The parser will see '"one two"' as a single string token. There're multiple questions which is confusing, what do you wish to know?Submarginal
It's the parser. Not sure what else you want to know. The existing answer here looks like ChatGPT generated waffle.Crandall
@Crandall - this is what I wanted to know. I only have one question left after reading the quoted Wikipedia article above (that article confuses me with its concept of "token evaluation stage") - can we say that during the execution of any Python script, the output of the built-in Python interpreter's tokenizer (which is essentially the same as tokenize.py), i.e. sequence of tokens (represented as 5-tuples), goes directly to the Python parser (without the intermediate "token evaluation stage" described in the Wikipedia)?Elwood
In other words, is the Python interpreter's tokenizer just a "scanner" in Wikipedia terms, but not a "scanner+evaluator" (Python parser does all the "token evaluation" work - not to be confused with "object evaluation" and "expression evaluation")?Elwood
Hmm, I'm not sure about that - the tokenizer has already evaluated the types of the tokens (e.g. NUMBER, OP, STRING). For strings, they may still need some unescaping, but this is not something specific to newlines, there could be other kinds of escapes to handle too e.g. \U0001F4A9.Crandall
@Crandall Wikipedia says that finding the type of a lexeme is done by a scanner, not by an evaluator. Now I think that tokenize is a pure scanner in Wiki terms mainly for two reasons: 1) it has a method tokenize.untokenize which allows to do an exact reconstruction of the lexeme from its token ("The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured. The guarantee applies only to the token type and token string, as the spacing between tokens (column positions) may change.")Elwood
2) At the beginning of the tokenize docs there is the sentence "The tokenize module provides a lexical scanner for Python source code" (of course, there is no guarantee that the docs interpret the term "scanner" exactly as the wiki does, but it seems so).Elwood
It may be instructive to compare with the tokenization of "one \\\ntwo"Polito
Thanks for this question, knowing about the tokenizer clears up a question I've always had: why can't a raw string end with a backslash? The reason appears to be that the tokenizer is responsible for determining the end of the string, and it doesn't know about the difference between raw strings and regular strings - that happens further down the pipeline.Moneybag

© 2022 - 2024 — McMap. All rights reserved.