Python Interpreter String Pooling Optimization [duplicate]
Asked Answered
G

1

6

After seeing this question and its duplicate a question still remained for me.

I get what is and == do and why if I run

a = "ab"
b = "ab"

a == b

I get True. The question here would be WHY this happens:

a = "ab"
b = "ab"
a is b # Returns True

So I did my research and I found this. The answer says Python interpreter uses string pooling. So if it sees that two strings are the same, it assigns the same id to the new one for optimization.

Until here everything is alright and answered. My real question is why this pooling only happens for some strings. Here is an example:

a = "ab"
b = "ab"
a is b # Returns True, as expected knowing Interpreter uses string pooling

a = "a_b"
b = "a_b"
a is b # Returns True, again, as expected knowing Interpreter uses string pooling

a = "a b"
b = "a b"
a is b # Returns False, why??

a = "a-b"
b = "a-b"
a is b # Returns False, WHY??

So it seems for some characters, string pooling isn't working. I used Python 2.7.6 for this examples so I thought this would be fixed in Python 3. But after trying the same examples in Python 3, the same results appear.

Question: Why isn't string pooling optimized for this examples? Wouldn't it be better for Python to optimize this as well?


Edit: If I run "a b" is "a b" returns True. The question is why using variables it returns False for some characters but True for others.

Gaillardia answered 21/2, 2017 at 9:55 Comment(9)
python 3.4.4 windows: >>> a = "a-b";b = "a-b" >>> a is b TrueHorrify
@Jean-FrançoisFabre python3.4.3 on Ubuntu returns FalseGaillardia
python 3.5 windows 'a b' is 'a b' evaluates to TrueGus
@JacquesdeHooge try it with variables. I get True with your example but False when assigning variablesGaillardia
you have your answer: it's implementation dependent, you shouldn't rely on that.Horrify
@CarlesMitjans Indeed, with variables I get False with Python 3.5 on Windows.Gus
@Jean-FrançoisFabre in a normal user point of view I'm happy knowing that. But for future python updates, shouldn't it be better to optimize it?Gaillardia
The internals of Python string interningAiry
Simple answer is that you shouldn't rely on is for comparing equality, use ==. This is CPython implementation detail hence you shouldn't use it, it can even break in future CPython releases and might not work in other interpreters at all. Another example of this is that complex number literals are cached in PyPy but not in CPython.That
A
5

Your question is a duplicate of a more general question "When does python choose to intern a string", the correct answer to which is that string interning is implementation specific.

Interning of strings in CPython 2.7.7 is described very well in this article: The internals of Python string interning. Information therein allows to explain your examples.

The reason that the strings "ab" and "a_b" are interned, whereas "a b" and "a-b" aren't, is that the former look like python identifiers and the latter don't.

Naturally, interning every single string would incur a runtime cost. Therefore the interpreter must decide whether a given string is worth interning. Since the names of identifiers used in a python program are embedded in the program's bytecode as strings, identifier-like strings have a higher chance of benefiting from interning.

A short excerpt from the above article:

The function all_name_chars rules out strings that are not composed of ascii letters, digits or underscores, i.e. strings looking like identifiers:

#define NAME_CHARS \
    "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

/* all_name_chars(s): true iff all chars in s are valid NAME_CHARS */

static int
all_name_chars(unsigned char *s)
{
    static char ok_name_char[256];
    static unsigned char *name_chars = (unsigned char *)NAME_CHARS;

    if (ok_name_char[*name_chars] == 0) {
        unsigned char *p;
        for (p = name_chars; *p; p++)
            ok_name_char[*p] = 1;
    }
    while (*s) {
        if (ok_name_char[*s++] == 0)
            return 0;
    }
    return 1;
}

With all these explanations in mind, we now understand why 'foo!' is 'foo!' evaluates to False whereas 'foo' is 'foo' evaluates to True.

Airy answered 21/2, 2017 at 10:18 Comment(3)
The article you link to is great, but it's about Python 2.7, which is important since these are implementation details. Anyway I think this is a dupe: #10622972Biocellate
@Biocellate Agree. I updated the answer and added my vote to close the question as a dupe.Airy
it's a dupe but your explanation is kind of new and makes sense.Horrify

© 2022 - 2024 — McMap. All rights reserved.