Python sum, why not strings? [closed]
Asked Answered
M

8

61

Python has a built in function sum, which is effectively equivalent to:

def sum2(iterable, start=0):
    return start + reduce(operator.add, iterable)

for all types of parameters except strings. It works for numbers and lists, for example:

 sum([1,2,3], 0) = sum2([1,2,3],0) = 6    #Note: 0 is the default value for start, but I include it for clarity
 sum({888:1}, 0) = sum2({888:1},0) = 888

Why were strings specially left out?

 sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
 sum2(['foo','bar'], '') = 'foobar'

I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.

Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.

Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?

Mcgaw answered 19/8, 2010 at 19:13 Comment(16)
Because it does not make sense to "sum" strings.Fullmer
@NullUserException: it makes as much sense to "sum" strings as it is to "sum" lists.Mcgaw
@Muhammad Alkarouri: Summing a sequence sums the elements of the sequence. What are the elements of a string? Individual characters, which are -- still -- not numbers. You can't sum them. You could "concatenate" them, but -- interestingly -- they're already concatenated into the string.Madelenemadelin
@NullUserException: It would be great if you were right, but sadly the + operation on strings is already overloaded to mean concatentation. So with + we already construct string "sums".Chlortetracycline
@kaizer I know that; I think this owes to the fact that most languages uses + as the concatenation operator. Still doesn't make sense to sum strings.Fullmer
@S.Lott: I meant summing a sequence of lists compared to summing a sequence of strings. As it happens, "sum" of a list of lists concatenates the lists. You can sum two lists using + to concatenate them. You can sum two strings using + to concatenate them. So it makes as much sense to define sum as concatenation for strings as it is for lists. That is what I meant. Whether this is good or is bad is beside the question.Mcgaw
@Muhammad Alkarouri: "I meant summing a sequence of lists compared to summing a sequence of strings." That's doesn't seem to be clear in your question.Madelenemadelin
@S.Lott: read my question again. It is quite clear there. I said: "for all types of parameters except strings. It works for numbers and lists, for example." Which means that numbers and lists are parameters in much the same way strings are. How did you understand the comparison between sum and "".join?Mcgaw
@Muhammad Alkarouri: In your opinion it's quite clear. You wrote it. Of course you think it's clear. However, you have at least one person to whom it was not clear. You can argue with me about things that confused me, but it won't help. I was confused. I did ask. You can fix it or you can ignore me. But it doesn't make sense to argue too much. sum("abc") and sum([1,2,3]) is what I understood you to be asking about. You can claim it's clear all you want, but it actually confused me and I actually asked.Madelenemadelin
@S.Lott: Of course I cannot objectively say what I write is clear, if there hadn't been multiple answers from people understanding the question correctly. I assumed that people aware with sum know about the string special case. Anyway, I have explained in that comment you replied to, and I have edited the question. Thanks for your input.Mcgaw
@Muhammad Alkarouri: " I assumed that people aware with sum know about the string special case." Evidence that you're smarter than some people reading your questions. Don't argue with people who are confused. Explain to people who are confused.Madelenemadelin
@S.Lott: I appreciate your advice. On the other hand, I didn't mean it as evidence that I am smarter than some people. I meant that people who care about this question either know in advance or are willing to put in the effort to lookup Python sum which mentions this immediately. So it's not about being smart, it is about due diligence. I will, honestly, have your advice in mind for future situations. Thanks again.Mcgaw
@Muhammad Alkarouri: Here's what you're refusing to consider. Your question was -- clearly -- about sum("abc") and sum([1,2,3]) and nothing more. You can claim that I should have done some mysterious due diligence to determine that your question wasn't trivially bad. I read a lot of bad SO questions. There is no magical due diligence for separating bad questions from good. I can only ask. You can revise your question to make it clear. When one person bothers to ask for clarification you have to take that as evidence that something's wrong with the question.Madelenemadelin
@Madelenemadelin Not to beat a dead horse, but I read the question and understood it instantly. And on a more technical level, characters in a Python string are just strings themselves, you can technically /can/ sum the characters, resulting in ordinary concatenation. (','.join('foo'), for example, returns 'f,o,o'.)Commutate
I think dan04's answer, efficiency is the key here.Vogele
Actually, sum2 should be more like return reduce(operator.add, iterable, start). reduce can also take an optional start parameter, and if omitted, unlike sum, would raise an Exception when given an empty sequence.Biegel
E
49

Python tries to discourage you from "summing" strings. You're supposed to join them:

"".join(list_of_strings)

It's a lot faster, and uses much less memory.

A quick benchmark:

$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop

Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.

BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)

Eternize answered 19/8, 2010 at 19:15 Comment(9)
The benchmark is useful. Your answer would be more complete if it compares the reduce vs sum for lists, which gives me around the same result. So the argument for strings would work only for strings.Mcgaw
Actually, I think it's the other way around: reduce and sum are similar for lists, because they do more or less the same thing. That's why I used this benchmark: reduce, representing sum, versus join, which is an optimised way of "adding" strings.Eternize
I am feeling old now. I understand "forever" in Python as before 2.0. Things done after the introduction of new style classes are not that "forever".Mcgaw
Hehe True :) I meant in terms of sum's lifetime, r32347's commit message is "Adding new built-in function sum, with docs and tests."Eternize
It does seem "unpythonic" to discourage this usage. I agree with Muhammad that it's a strange exception. But it seems this is the best reason available.Bicephalous
I think it's there because of the chronological sequence of implementations: when they implemented the sum builtin, they already had join for strings. So, to avoid people unknowingly (or knowingly, but lazily/mischievously) using sum for strings, they specifically disallowed it. Since there wasn't a specific "sum" for lists (or other types), they made the exception only for strings. Nowadays, I think they'd keep it this way, even if someone came up with a specific "sum", for backwards compatibility.Eternize
It's not optimizing for a common case. Optimization would be using that same if statement blocking it to instead invoke the (already written) str.join code. Since that would be such an easy optimization to make, I'd assume the intent is to deliberate fail noisily to teach people about the existence of str.join.Victualler
@Victualler you're right, that's not an optimisation at all. I'm not sure why I wrote that :)Eternize
Makes you wonder why sum does not just delegate to ''.join when start is a string, though, particularly since it already seems to have a special treatment for strings to produce that particular error message.Biegel
C
27

You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..

>>> class ZeroObject(object):
...  def __add__(self, other):
...   return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Chlortetracycline answered 19/8, 2010 at 20:1 Comment(3)
I find this very interesting. Of course it is too clever by half, but it adds to the understanding of this feature.Mcgaw
And it’s a little faster than the reduce version.Famine
Still it's weird that Python checks for strings, but not for lists or tuples.Misbehavior
Y
17

Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup

In the builtin_sum function we have this bit of code:

     /* reject string values for 'start' parameter */
        if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
            PyErr_SetString(PyExc_TypeError,
                "sum() can't sum strings [use ''.join(seq) instead]");
            Py_DECREF(iter);
            return NULL;
        }
        Py_INCREF(result);
    }

So.. that's your answer.

It's explicitly checked in the code and rejected.

Yorick answered 20/8, 2010 at 5:15 Comment(4)
It's interesting to see the code, but the question was "Why aren't strings summed", not "How did they exclude strings from summing?" ...Bicephalous
Well, I meant, the reason why its not working is because its explicitly banned in the code. Others seemed to answer by explaining why you shouldn't do it.Yorick
Thanks for sharing the source code. But I get a different traceback when I try to sum two strings. sum(['foo', 'bar']) -> TypeError: unsupported operand type(s) for +: 'int' and 'str'. But sum(['foo', 'bar'], '') -> TypeError: sum() can't sum strings [use ''.join(seq) instead]. Any ideas why the first one returns a different traceback? I think it's more likely to occur that way.Oculo
I just noticed the answer to my above comment is explained by @Chlortetracycline below. I now realise the start=0 argument is the initial value for the summation process.Oculo
D
14

From the docs:

The preferred, fast way to concatenate a sequence of strings is by calling ''.join(sequence).

By making sum refuse to operate on strings, Python has encouraged you to use the correct method.

Delitescence answered 19/8, 2010 at 19:16 Comment(3)
So it's part of the 'one way to do it' philosophy. (As they could just have made sum treat strings specially)Hoebart
@ThomasAhle: There is a computational reason why sum is the wrong tool for the job. sum builds the result incrementally. Doing so with strings requires (potentially many) temporary immutable strings. If done naively, this process requires lots of temporary memory and lots of copying. In contrast, if done with ''.join(), the appropriate amount of memory is allocated from the beginning, and result is built right there at once. See Martelli's exhortation.Delitescence
Sure, but you can do lots of inefficient things in Python, and often it doesn't matter. I can't think of any other cases than 'sum of strings' where there are tests against doing something, hard coded in the Python C code!Hoebart
G
11

Short answer: Efficiency.

Long answer: The sum function has to create an object for each partial sum.

Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.

doubles are always the same size, which makes sum's running time O(1)×N = O(N).

int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).

For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).

Thus, summing strings would be much slower than summing numbers.

str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.

Goldsmith answered 20/8, 2010 at 4:46 Comment(3)
That argument is wrong because the same would hold for lists and tuples, which can be summed. As stated by HS, Python explicitly check for strings, and only for strings, which just doesn't make sense.Misbehavior
@Philipp: The missing piece is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.Scrutineer
Excellent answer. Except for "would be much slower" has to be "is much slower" as has been proven.Suannesuarez
S
10

The Reason Why

@dan04 has an excellent explanation for the costs of using sum on large lists of strings.

The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.

Scrutineer answered 28/4, 2014 at 20:5 Comment(2)
Is there any reason why the sum() implementation doesn't just call ''.join() when a string is encountered in sum() instead of returning an error instructing you to call ''.join()?Fancier
@Slater: Yes. But I don't recall what it was. :(Scrutineer
F
4

Edit: Moved the parts about immutability to history.

Basically, its a question of preallocation. When you use a statement such as

sum(["a", "b", "c", ..., ])

and expect it to work similar to a reduce statement, the code generated looks something like

v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")

In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.

A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.

Famine answered 19/8, 2010 at 19:26 Comment(5)
It's true, but that's not the whole story. Integers are immutable as well. But the join operation on strings is specialised, takes the whole list into account, and, therefore, is much faster.Eternize
Yeah, ok maybe immutability is not the real problem here. (Though I can imagine that register-sized integers suffer less from immutability on the Python-side than strings do.) But then I think that preallocation actually is the whole story.Famine
@Debiliski: The preallocation was probably the missing link for me; it tells why "",join behaves so much better than a generic sum. I had to look at the code to understand. I think the immutability is a red herring.Mcgaw
@Debiliski: Unfortunately I have accepted the other answer, else I would have suggested that you edit yours to make the preallocation and explanation more prominent.Mcgaw
Does it actually pre-allocate? Join's argument can be an arbitrary iterator, which you might only be able to traverse once. Without being familiar with the code, I'd guess the key point is at the C level you've got an over-allocated array which you can mutate in place; since it hasn't been released to python code yet this doesn't break the "python strings are immutable" law. The repeated allocations/copying can be amortized to O(n) if you always increase the size by at least double when you need more (rather than doing it for every string).Victualler
F
3

I dont know why, but this works!

import operator
def sum_of_strings(list_of_strings):
    return reduce(operator.add, list_of_strings)
Fishhook answered 21/10, 2014 at 12:45 Comment(1)
Because reduce is not sum -- but it will still have horrible performance for large lists.Scrutineer

© 2022 - 2024 — McMap. All rights reserved.