Why does Python return [15] for [0xfor x in (1, 2, 3)]? [duplicate]

Asked 13/4, 2021 at 22:12 Answered 14/4, 2021 at 16:14

Solved python python-3.x operator-precedence short-circuiting

When running the following line:

>>> [0xfor x in (1, 2, 3)]

I expected Python to return an error.

Instead, the REPL returns:

[15]

What can possibly be the reason?

Across answered 13/4, 2021 at 22:12 Comment(6)

Note that Python sees this as [0xf or x in (1, 2, 3)]. You've actually found a minor bug in Stack Overflow's syntax highlighter, as it renders the 0xfor without colouring the or ;) – Methodical 14/4, 2021 at 0:40

Quite unexpected... obviously this is useful for codegolfing, but it doesn't feel really consistent at all with the rest of the syntax. IMHO I'd have preferred if strings of consecutive alphanumeric characters were always considered single tokens. – Postmeridian 14/4, 2021 at 7:37

I feel like it's a bug in the parser. For the record, it gives the same result with 3or 4 or "hello"and 5. I suspect it's a consequence to accommodate cases for binary operators such as "3>4", but in the case of comparison ops, it's not a straight connection as you can't do 3and5. I posted in python-dev and see what they say – Stansbury 14/4, 2021 at 10:25

Also note that it's not due to the new parser. The behavior is found also in python 2.7 – Stansbury 14/4, 2021 at 10:29

Storchaka verbatim "it does not contradict specification, but looks pretty confusing, so we will likely change specification and implementation to prevent confusion.". It is also known since 2018. – Stansbury 14/4, 2021 at 10:42

@StefanoBorini "hello"and 5 and 3>5 are different. " and > are not valid in identifiers or other forms of expressions. What is unexpected is that a string of pure alphanumeric characters (i.e. [a-z0-9]) can be interpreted as 2 tokens instead of one "randomly" – Postmeridian 14/4, 2021 at 12:40

104

TL;DR

Python reads the expression as [0xf or (x in (1, 2, 3))], because:

The Python tokenizer.
Operator precedence.

It never raises NameError thanks to short-circuit evaluation - if the expression left to the or operator is a truthy value, Python will never try to evaluate the right side of it.

Parsing hexadecimal numbers

First, we have to understand how Python reads hexadecimal numbers.

On tokenizer.c's huge tok_get function, we:

Find the first 0x.
Keep reading the next characters as long as they're in the range of 0-f.

The parsed token, 0xf (as "o" is not in the range of 0-f), will eventually get passed to the PEG parser, which will convert it to the decimal value 15 (see Appendix A).

We still have to parse the rest of the code, or x in (1, 2, 3)], which leaves as with the following code:

[15 or x in (1, 2, 3)]

Operator precedence

Because in have higher operator precedence than or, we might expect x in (1, 2, 3) to evaluate first.

That is troublesome situation, as x doesn't exist and will raise a NameError.

`or` is lazy

Fortunately, Python supports Short-circuit evaluation as or is a lazy operator: if the left operand is equivalent to True, Python won't bother evaluating the right operand.

We can see it using the ast module:

parsed = ast.parse('0xfor x in (1, 2, 3)', mode='eval')
ast.dump(parsed)

Output:


    Expression(
        body=BoolOp(
            op=Or(),
            values=[
                Constant(value=15),   # <-- Truthy value, so the next operand won't be evaluated.
                Compare(
                    left=Name(id='x', ctx=Load()),
                    ops=[In()],
                    comparators=[
                        Tuple(elts=[Constant(value=1), Constant(value=2), Constant(value=3)], ctx=Load())
                    ]
                )
            ]
        )
    )

So the final expression is equal to [15].

Appendix A: The PEG parser

On pegen.c's parsenumber_raw function, we can find how Python treats leading zeros:

    if (s[0] == '0') {
        x = (long)PyOS_strtoul(s, (char **)&end, 0);
        if (x < 0 && errno == 0) {
            return PyLong_FromString(s, (char **)0, 0);
        }
    }

PyOS_strtoul is in Python/mystrtoul.c.

Inside mystrtoul.c, the parser looks at one character after the 0x. If it's an hexadecimal character, Python sets the base of the number to be 16:

            if (*str == 'x' || *str == 'X') {
                /* there must be at least one digit after 0x */
                if (_PyLong_DigitValue[Py_CHARMASK(str[1])] >= 16) {
                    if (ptr)
                        *ptr = (char *)str;
                    return 0;
                }
                ++str;
                base = 16;
            } ...

Then it parses the rest of the number as long as the characters are in the range of 0-f:

    while ((c = _PyLong_DigitValue[Py_CHARMASK(*str)]) < base) {
        if (ovlimit > 0) /* no overflow check required */
            result = result * base + c;
        ...
        ++str;
        --ovlimit;
    }

Eventually, it sets the pointer to point the last character that was scanned - which is one character past the last hexadecimal character:

    if (ptr)
        *ptr = (char *)str;

Thanks

CSI_Tech_Dept from reddit for referring me to the correct section in the tokenizer.c file.
The original Tweet.

Across answered 13/4, 2021 at 22:12 Comment(3)

Sometimes I think Python was never intended to be a real product. 670 lines of tokenizing in a single method? Who wants to maintain that? – Kan 14/4, 2021 at 7:46

@defalt What space are you talking about? There is no space between 0x and f in the line being asked about. – Uniaxial 14/4, 2021 at 17:52

@ThomasWeller A) for a tokenizer, that's not bad. B) Python is not a "product", real or otherwise, and indeed wasn't intended as one. It started life as a teaching language. – Hypochondriasis 14/4, 2021 at 18:1

Other answers already tell what exactly happens. But for me, the interesting part was that the operator is recognized even without whitespace between the number and it. Actually, my first thought was "Wow, Python has a weird parser".

But before judging too harshly, maybe I should ask my other friends what they think:

Perl:

$ perl -le 'print(0xfor 3)'
15

Lua:

$ lua5.3 -e 'print(0xfor 4)'
15

Awk doesn't have or, but it has in:

$ awk 'BEGIN { a[15]=1; print(0x0fin a); }'
1

Ruby? (I don't really know it, but let's guess):

$ ruby -e 'puts 0x0for 5'
15

Yep, FWIW, Python is not alone, all of those other script-type languages also recognize the alphabetic operators even if stuck immediately to the back of a numeric constant.

Jackleg answered 14/4, 2021 at 16:14 Comment(1)

If you use bash or zsh, you can also try this: echo $(( 34#0xfor -15 )) ― This is different from the other cases, though, because there is no hidden or operator here. – Somewhat 15/4, 2021 at 12:16

As others have explained, it’s just the hexadecimal number 0xf followed by the operator or. Operators generally don’t need surrounding spaces, unless necessary to avoid ambiguity. In this case, the letter o cannot be part of a hexadecimal number, so there is no ambiguity. See the section on whitespace in the Python language reference.

The rest of the line is not evaluated because of short-circuit evaluation, although it is parsed and compiled, of course.

Using that same “trick” you can write similarly obfuscated Python code that doesn’t throw exceptions, for example:

>>> 0xbin b'in'
False
>>> 0xbis 1000
False
>>> 0b1and 0b1is 0b00
False
>>> 0o1if 0b1else Oy1then
1

Somewhat answered 14/4, 2021 at 14:31 Comment(0)

TL;DR

Parsing hexadecimal numbers

Operator precedence

`or` is lazy

Appendix A: The PEG parser

Thanks

Recommended topics

Hot tags

TL;DR

Parsing hexadecimal numbers

Operator precedence

or is lazy

Appendix A: The PEG parser

Thanks

Recommended topics

Hot tags

`or` is lazy