If Python strings are immutable, why does it keep the same id if I use += to append to it?
Asked Answered
O

4

102

Strings in Python are immutable, which means the value cannot be changed. However, when appending to the string in the following example, it looks like the original string memory is modified since the id remains the same:

>>> s = 'String'
>>> for i in range(5, 0, -1):
...     s += str(i)
...     print(f"{s:<11} stored at {id(s)}")
... 
String5     stored at 139841228476848
String54    stored at 139841228476848
String543   stored at 139841228476848
String5432  stored at 139841228476848
String54321 stored at 139841228476848

Conversely, in the following example, the id changes:

>>> a = "hello"
>>> id(a)
139841228475760
>>> a = "b" + a[1:]
>>> print(a)
bello
>>> id(a)
139841228475312
Octopod answered 10/6, 2022 at 3:8 Comment(28)
Who said that str is immutable?Chanterelle
@MohamadGhaithAlzin: The docs, for one: "Strings are immutable sequences of Unicode code points."Insubordinate
The optimization you're seeing here breaks string immutability. The devs either didn't think about it or decided the minor violation of immutability was worth it to stop people from constantly complaining about the performance of this kind of loop.Insubordinate
The standard wisdom is that Python strings are immutable. You can't change a string's value, only the reference to the string. continue reading hereChanterelle
@user2357112: They thought about it, and they only do it when the mutability is undetectable by anything other than silly id checks of this sort. If you add s2 = s after the s += str(i) line, you'll see the id change all the time, because now that the str is visible through multiple aliases, they can't use the optimization.Dorfman
A string variable is actually an object. the string value of the object is stored in a different memory address. What you have printed is the id of the object, not the string itself. The string is actually being copied.Loathe
@chouyangv3: You're mistaken. CPython stores the core data of the string in what amounts to a flexible array member at the end of the struct (it can also store other copies of the data in separate arrays, but the canonical representation is always allocated inline, in the same allocation as the struct itself); if the string is actually copied to a new object, the id changes. The optimization in CPython sometimes lets it avoid that copy by reallocing in place when it can, if the mutation is not otherwise detectable.Dorfman
@Dorfman So it acts like golang slices? I'm not familiar with python, sorry.Loathe
@chouyangv3: You need to understand C to know what the CPython reference interpreter is doing here, specifically flexible array members (which were standardized in C99, but you could simulate them in any version of C either by putting a length 1 array at the end of a struct and choosing to allocate more than just sizeof(thestruct), or just by allocating extra and casting a pointer to the byte after the struct to the correct type; old str did the former, new str [with variable width characters] does the latter).Dorfman
@chouyangv3: You can see the description of the struct layout in the internal header defining the multiple structs that can implement str in modern Python.Dorfman
@Insubordinate Why do you say it breaks immutability? All we see is that the object afterwards has the same address that the object beforehand had. That doesn't mean that they're the same object.Marrakech
@KellyBundy: It actually does mean that. An object's id is temporally unique (while an object possesses a specific id, no other object can have the same id). s += str(i) for immutable types is defined to produce the concatenated value of s and str(i) first, then replace s after. Since s still exists when the concatenation occurs, the concatenated value would not be allowed to have the same id (because s already has it). CPython is cheating when it does this, but it's a harmless sort of cheat (improves performance, only observable effect otherwise is the id quirk).Dorfman
@Dorfman How do you know Python doesn't create a new object, then deletes the old one, then moves the new one to the old address?Marrakech
@KellyBundy: You can hypothesize all sorts of nuts things, but when there is zero reason to do it (it would be the opposite of the optimization CPython actually does, doubling the work instead of reducing it), you can rule insanity like that out. Plus, I've read the source code, for multiple releases of CPython; it's definitely doing that (the exact mechanics have changed over time, but it's all fundamentally doing the same optimization). :-)Dorfman
@Dorfman Yes, I also know it's doing that. But you're rather proving my point: You know because you've read the implementation. In my view, immutability is a Python concept, and it doesn't matter what CPython does internally. If you need to refer to the CPython implementation to argue that it's still the same object, but Python doesn't say it's the same object, that doesn't count.Marrakech
@KellyBundy: See my comments on Mark's answer below; I went back to the data model to verify. The language spec itself imposes an operation ordering guarantee and an id consistency guarantee, that, in combination, are definitely broken by this optimization (in the sense that it proves a violation of str immutability). Your "copy it back to the old strs memory" suggestion (which is itself based on a CPython optimization detail) isn't allowed, because by the language spec, both objects must briefly coexist, and the id of both must be constant throughout their lifetime.Dorfman
@Dorfman Looks convincing, yes. I don't think x = x.__add__(y) is allowed to drop the old reference before calling the function. Fun fact: if you do use s = s.__add__(t) instead of s += t, you get different ids, so they're not really equivalent.Marrakech
I guess this proves the value of keeping it short... (Same question but zero votes, maybe due to its off-putting length.)Marrakech
Does this answer your question? How is the s=s+c string concat optimization decided?Responsibility
@Responsibility just because something is related and older does not mean this should be closed as duplicate and readers should be diverted away from these excellent answers and to that answer instead. Duplicates can point forward in time as well, so if these are really duplicates then perhaps that should be closed as duplicate of this instead?Postprandial
@Postprandial Actually it is not just "related" , it is exactly the same question. If this is not the case where we should use the duplicate flag, then when is the proper time? These answers are of course fantastic I upvoted too, but if the question is closed as duplicate, still users can see these answers. They won't be deleted. I saw a lot of great questions/answers which received more than thousands of upvotes but closed as duplicate.Responsibility
@Responsibility so close that one not this one because these answers are much better. Nobody knows how answer visibility is impacted when questions are closed. Have there been analyses to see if views of well-received answers are reduced by closure as duplicate which try to control for other factors? But older does not equal better.Postprandial
@Postprandial meta.https://mcmap.net/q/212186/-delphi-quot-invalid-field-type-quot-error-with-master-detail-datasetsResponsibility
@Responsibility so instead of oldest answers, lets steer future readers to the best answers; after all, good answers to on-topic questions is what it's all about, right? (you could also propose merging them if you think the old answers help here)Postprandial
@Postprandial The linked question is not really old(9 months ago) and has more details in the question itself. It shows the performance comparison of this behavior as well so that makes the question even more exciting... Both questions received great answers from two knowledgeable guys I know here(That one is from Tim Peters). With all due respect, I have one vote for myself and to me duplicate is duplicate. It still requires two more votes if others think like me.Responsibility
@Postprandial This is the first time I hear about merging. Could you tell me how can I request for that? by flagging for the moderator?Responsibility
@Responsibility What is a "merged" question? (surprisingly while it links back to FAQ it doesn't appear there) It requires some moderator effort and decisions so they usually don't like to do it a lot. In SO: meta.stackoverflow.com/search?q=merging+questionsPostprandial
These two questions do not seem to be duplicates. This question here is about whether optimisation happens at all and its semantics – re-using ids is not in itself an indication that an immutable type is mutated, and commonly happens because the memory is simply re-used. The other question already starts at taking the optimisation for granted and asks how CPython specifically implements it. Neither does this question's answers solve the other question, nor do the other question's answers solve this question. They are interesting trivia wrt each other but not more.Buonomo
D
125

It's a CPython-specific optimization for the case when the str being appended to happens to have no other living references. The interpreter "cheats" in this case, allowing it to modify the existing string by reallocating (which can be in place, depending on heap layout) and appending the data directly, and often reducing the work significantly in loops that repeatedly concatenate (making it behave more like the amortized O(1) appends of a list rather than O(n) copy operations each time). It has no visible effect besides the unchanged id, so it's legal to do this (no one with an existing reference to a str ever sees it change unless the str was logically being replaced).

You're not actually supposed to rely on it (non-reference counted interpreters can't use this trick, since they can't know if the str has other references), per PEP8's very first programming recommendation:

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).

For example, do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn’t present at all in implementations that don’t use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

If you want to break the optimization, there are all sorts of ways to do so, e.g. changing your code to:

>>> while i!=0:
...     s_alias = s  # Gonna save off an alias here
...     s += str(i)
...     print(s + " stored at " + str(id(s)))
...     i -= 1
... 

breaks it by creating an alias, increasing the reference count and telling Python that the change would be visible somewhere other than s, so it can't apply it. Similarly, code like:

s = s + a + b

can't use it, because s + a occurs first, and produces a temporary that b must then be added to, rather than immediately replacing s, and the optimization is too brittle to try to handle that. Almost identical code like:

s += a + b

or:

s = s + (a + b)

restores the optimization by ensuring the final concatenation is always one where s is the left operand and the result is used to immediately replace s.

Dorfman answered 10/6, 2022 at 3:16 Comment(6)
Subtle way to avoid the optimization (at least sometimes :-)Marrakech
@KellyBundy: Yep, this is why PEP8 says not to rely on it. It's brittle. It's nice when it helps you, but if you have code that needs it for adequate performance, your code needs to be rewritten to use list + ''.join, or io.StringIO or whatever. (I deleted my old comments, you're welcome to do so)Dorfman
@KellyBundy why did it avoid optimization?According
@According Note that unlike the question's code, I print s, not a new string built from s. And the output is buffered. Apparently print or the "file" it writes to keeps a reference to s then. If I print with flush=True, I get the optimization again.Marrakech
@KellyBundy: It looks like it's due to how io.TextIOWrapper.write is implemented; the buffer at that layer is maintained as either the single thing written, or (when there's more than one buffered thing) a list of things that have been written. As an optimization, when the encoding of the file is ASCII-compatible and the str is ASCII, it takes a reference to the str directly rather than encoding it to bytes before putting it in the list.Dorfman
Change s += str(i) to s += chr(130 - i) and the write optimization disappears, enabling the str += optimization, at the moment the first non-ASCII character is appended, because TextIOWrapper.write is now proactively encoding the non-ASCII str to bytes, and dropping the reference to the original str before it returns.Dorfman
B
44

Regardless of implementation details, the docs say:

… Two objects with non-overlapping lifetimes may have the same id() value.

The previous object referenced by s no longer exists after the += so the new object breaks no rules by having the same id.

Bathhouse answered 10/6, 2022 at 4:16 Comment(8)
The lifetimes are defined to overlap though, per the language data model spec, where += is defined as one of x = x.__iadd__(y), x = x.__add__(y) or x = y.__radd__(x) (and immutable types are defined by not implementing the former). The delayed reassignment to x after the work completes is mandatory, officially. In all cases, the new object returned from the method must exist prior to the reassignment of x, so the actual spec requires both to exist at once, and therefore the ids must differ, at least briefly.Dorfman
Hmm... Just remembered "at least briefly" is being too lenient. ids are required to stay different, because the id of all objects is required to stay constant for its lifetime, so you can't perform shenanigans like giving the id of the original str to the new str.Dorfman
@Dorfman No user code can run during that brief overlap, so there's no way to detect the shared ID during that period. That's why this optimization is OK.Harveyharvie
@Harveyharvie Are you saying there are two separate objects with briefly shared ID?Marrakech
@Barmar: You may be right (excluding debuggers) in the design as of 3.11 (which limits the optimization to cases followed by STORE_FAST). It's weirder through 3.10 though (that is, the current release), as it applied to cases followed by STORE_DEREF or STORE_NAME, which could, in very specific circumstances involving threads sharing a closure variable, observe the clearing of the target, as the GIL could be taken by another thread between opcodes. and see the pre-cleared variable (they wouldn't see a shared id, because the spec was violated and it changed the target early).Dorfman
@KellyBundy I'm saying that even if they conceptually overlap, you can't detect it so the optimizer can ignore that requirement.Harveyharvie
The term "lifetime" doesn't have a formal definition in the Python docs, so although the comment about __iadd__ is interesting, it isn't quite right to say that the objects' lifetimes are defined to overlap. If the lifetime of the old string object is allowed to end before x no longer references that object, which would be perverse but not contradictory, then the lifetimes of the string objects don't overlap and they are allowed to have the same id. That said, I think it's simpler to describe the behaviour by saying that string objects are mutable but can never be observed to mutate.Grasmere
A more clear-cut example of Python strings being mutable is the fact that when you compute the hash of a string, the result is cached so that the interpreter won't have to compute the hash of the same string object again. The cache for a string's hash is a mutable member of the struct for that string, it's initialised as -1 and then overwritten when the hash is computed. So that struct is unquestionably mutable; it just can't be observed to mutate (unless you count time.perf_counter as a way of observing it).Grasmere
Z
4

If objects have non-overlapping lifetimes, their id values may be the same, but if variables have overlapping lifetimes so they must have different id values.

Zestful answered 15/6, 2022 at 6:57 Comment(0)
F
1

In C, Java, or some other programming languages, a variable is an identifier or a name, connected to a memory location.

e.g.

enter image description here

but in Python, a variable is considered a tag that is tied to some value. Python considers value as an object. and it can save memory and assign memory location with respect to value instead of variable.

enter image description here

Variables are the same but if values are different then it can assign new memory, in a closer look no variable can be referenced a=10 then it is removed by the garbage collector and a new value hold the variable in a new memory location.

enter image description here

In C, Java memory location is for variable but in Python, the memory location is for value which can be treated as an object.

Fanniefannin answered 30/6, 2022 at 10:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.