Let's say you have this code to build up a string from three strings:
x = 'foo'
x += 'bar' # 'foobar'
x += 'baz' # 'foobarbaz'
In this case, Python first needs to allocate and create 'foobar'
before it can allocate and create 'foobarbaz'
.
So for each +=
that gets called, the entire contents of the string and whatever is getting added to it need to be copied into an entirely new memory buffer. In other words, if you have N
strings to be joined, you need to allocate approximately N
temporary strings and the first substring gets copied ~N times. The last substring only gets copied once, but on average, each substring gets copied ~N/2
times.
With .join
, Python can play a number of tricks since the intermediate strings do not need to be created. CPython figures out how much memory it needs up front and then allocates a correctly-sized buffer. Finally, it then copies each piece into the new buffer which means that each piece is only copied once.
There are other viable approaches which could lead to better performance for +=
in some cases. E.g. if the internal string representation is actually a rope
or if the runtime is actually smart enough to somehow figure out that the temporary strings are of no use to the program and optimize them away.
However, CPython certainly does not do these optimizations reliably (though it may for a few corner cases) and since it is the most common implementation in use, many best-practices are based on what works well for CPython. Having a standardized set of norms also makes it easier for other implementations to focus their optimization efforts as well.
+=
behaves differently for string, integers. it is possible that Python takes more time in figuring the type of data on which+=
to operate on i.e. its addition if they are integers while concat if they are strings. While in' '.join()
operation, it expects string elements only - which makes Python to not worry about type of data its dealing with. – Analyzer+=
concatenation with strings. The exact reason+=
may beO(N)
in Python is not exactly the same reasonstrcat
isO(N)
in C, but it's similar. – Yarrowa = asprintf("%s%s", a, b)
, except also with doing garbage collection (since something else may or may not hold a reference to the originala
). The optimizations that are mentioned, I suspect, have to do with the fact that it knows when nothing else holds a reference to the original, and can do some trickery in that case - allocate a bit more space than needed, and reuse the same string since nothing else is using it and it's about to be free'd. – Brodsky%timeit
confirms this) that concatenating two strings with+
is much faster than usingjoin
. – Purltexts
with unknown size that could be tens/hundreds/thousands and you want to concatenate them, in such cases''.join(texts)
will be faster than a loop overtexts
using+=
. – Zigzagger+
is indeed faster. – Purljoin
is the better choice. – Zigzagger[x+y for x,y in zip(lst1, lst2)]
. And here, again,+
is faster. – Purl