About the changing id of an immutable string
Asked Answered
S

5

57

Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.

>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808

so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:

>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728

So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.

This is not the same behaviour as with (small) integers.

I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...

Update

Trying the same with a different string gave different results...

>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384

Now it is equal...

Salley answered 16/6, 2014 at 13:50 Comment(3)
Python does not intern strings by default. A lot of Python internal code does explicitly intern string values (attribute names, identifiers, etc.) but that doesn't extend to arbitrary strings.Hagiographa
Instead, Python is free to reuse memory slots. You need to create objects with a longer lifetime.Hagiographa
@Salley once a variable holds that value Is this statement correct in python? Read this.Prudential
H
85

CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.

Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.

As such, you will see the same id crop up from time to time.

Running just the line id(<string literal>) in the REPL goes through several steps:

  1. The line is compiled, which includes creating a constant for the string object:

    >>> compile("id('foo')", '<stdin>', 'single').co_consts
    ('foo', None)
    

    This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.

  2. On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:

    >>> import dis
    >>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
      1           0 LOAD_NAME                0 (id)
                  3 LOAD_CONST               0 ('foo')
                  6 CALL_FUNCTION            1
                  9 PRINT_EXPR          
                 10 LOAD_CONST               1 (None)
                 13 RETURN_VALUE        
    
  3. The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.

Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.

ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.

Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:

if (all_name_chars(v)) {
    PyObject *w = v;
    PyUnicode_InternInPlace(&v);
    if (w != v) {
        PyTuple_SET_ITEM(tuple, i, v);
        modified = 1;
    }
}

where all_name_chars() is documented as

/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */

Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.

Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:

>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True

If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:

>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True

The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.

E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:

>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
  1           0 LOAD_CONST               0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
              2 RETURN_VALUE

and

>>> dis.dis("'foo' * 6")
  1           0 LOAD_CONST               2 ('foofoofoofoofoofoo')
              2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
  1           0 LOAD_CONST               0 ('foo')
              2 LOAD_CONST               1 (7)
              4 BINARY_MULTIPLY
              6 RETURN_VALUE
Hagiographa answered 16/6, 2014 at 14:0 Comment(7)
I am not sure I fully understand what interning is in this sense, but I guess I'll have to read a little bit about it; thanks.Salley
@Bach: Interning is the act of re-using a string object if it was already created once before with the same value.Hagiographa
The vasts amounts of knowledge that @MartijnPieters has in the Python world is baffling to me ^^Corrigible
@MartijnPieters does Python have some sort of pool where it holds the interned strings during an active process ? And is there an algorithm by which it decides when to read and write in that pool (strings of a particular length maybe, etc.) ? sort of like Java, where AFAIK at each string creation it checks the pool to see whether it exists or not, if it does, it returns the reference, if not, it creates it and adds it there. I know that it might not be efficient in Python to do that because you'd waste runtime here not compile time so it wouldn't be that attractive to do that for all strs.Corrigible
@MariusMucenicu yes, my answer describes the algorithm in general terms. You can also grep for calls to PyUnicode_InternInPlace and PyUnicode_InternFromString functions in the Python source code to see where Python is interning strings (e.g. search on GitHub for either function)..Hagiographa
@MariusMucenicu the pool itself is a dict object defined in the unicodeobject.c source code, which also hosts the API functions for interning (except for one macro).Hagiographa
Amazing! Thanks @MartijnPieters! upvoted everything.Corrigible
S
4

This behavior is specific to the Python interactive shell. If I put the following in a .py file:

print id('so')
print id('so')
print id('so')

and execute it, I receive the following output:

2888960
2888960
2888960

In CPython, a string literal is treated as a constant, which we can see in the bytecode of the snippet above:

  2           0 LOAD_GLOBAL              0 (id)
              3 LOAD_CONST               1 ('so')
              6 CALL_FUNCTION            1
              9 PRINT_ITEM          
             10 PRINT_NEWLINE       

  3          11 LOAD_GLOBAL              0 (id)
             14 LOAD_CONST               1 ('so')
             17 CALL_FUNCTION            1
             20 PRINT_ITEM          
             21 PRINT_NEWLINE       

  4          22 LOAD_GLOBAL              0 (id)
             25 LOAD_CONST               1 ('so')
             28 CALL_FUNCTION            1
             31 PRINT_ITEM          
             32 PRINT_NEWLINE       
             33 LOAD_CONST               0 (None)
             36 RETURN_VALUE  

The same constant (i.e. the same string object) is loaded 3 times, so the IDs are the same.

Stroy answered 16/6, 2014 at 13:56 Comment(3)
@Salley I mean the Python interactive shell.Stroy
Same here; perhaps the "compiler" of python do some magic to avoid allocating memory for more than one instance of the same string here?Salley
@Salley Yes, the literal string 'so' is stored as a single constant, so every time you use it that same constant is loaded, which avoids having to create a new string each time.Stroy
H
1

In your first example a new instance of the string 'so' is created each time, hence different id.

In the second example you are binding the string to a variable and Python can then maintain a shared copy of the string.

Honoria answered 16/6, 2014 at 13:53 Comment(2)
The OP is rebinding the string object.Hagiographa
Your explanation is flawed; the second example binds new string literals to the same name, as well as to a different name. so is rebound, then not_so is rebound. This is not the same string object.Hagiographa
C
1

A more simplified way to understand the behaviour is to check the following Data Types and Variables.

Section "A String Pecularity" illustrates your question using special characters as example.

Cannady answered 5/2, 2015 at 5:29 Comment(0)
M
0

So while Python is not guaranteed to intern strings, it will frequently reuse the same string, and is may mislead. It's important to know that you shouldn't check id or is for equality of strings.

To demonstrate this, one way I've discovered to force a new string in Python 2.6 at least:

>>> so = 'so'
>>> new_so = '{0}'.format(so)
>>> so is new_so 
False

and here's a bit more Python exploration:

>>> id(so)
102596064
>>> id(new_so)
259679968
>>> so == new_so
True
Micaelamicah answered 16/6, 2014 at 13:58 Comment(1)
@Salley Would you say it answers the question now?Micaelamicah

© 2022 - 2024 — McMap. All rights reserved.