Perform cumulative sum over a column but reset to 0 if sum become negative in Pandas

Asked 15/8, 2019 at 13:44 Answered 16/8, 2019 at 8:46

I have a pandas dataframe with two columns like this,

Item    Value
0   A   7
1   A   2
2   A   -6
3   A   -70
4   A   8
5   A   0

I want to cumulative sum over the column, Value. But while creating the cumulative sum if the value becomes negative I want to reset it back to 0.

I am currently using a loop shown below to perform this,

sum_ = 0
cumsum = []

for val in sample['Value'].values:
    sum_ += val
    if sum_ < 0:
        sum_ = 0
    cumsum.append(sum_)

print(cumsum) # [7, 9, 3, 0, 8, 8]

I am looking for a more efficient way to perform this in pure pandas.

Peptone answered 15/8, 2019 at 13:44 Comment(10)

I think we do not have pandas method can achieve this – Boner 15/8, 2019 at 13:49

I was thinking the same and finnally settled with the solution with the loop i posted in the question. I was wondering I am missing out some pandas trick that could do the magic – Peptone 15/8, 2019 at 13:56

What you did is more like what I can offer, only little different I may using numba – Boner 15/8, 2019 at 13:57

I am not familiar with numba. How much improvement of performance (in terms of time) can I expect.? If you can post the code as answer I will check for myself and let you know whether it will be good for me. – Peptone 15/8, 2019 at 13:58

In term of performance pure python is not bad :-) – Boner 15/8, 2019 at 14:0

Can you post the code if possible.? – Peptone 15/8, 2019 at 14:4

#56904890 you can get some solution from there , even that is not 100% same – Boner 15/8, 2019 at 14:10

Great question, posted this as an improvement suggestion to pandas on GitHub @WeNYoBen – Nobie 15/8, 2019 at 14:39

@Nobie maybe it is better adding the upper and lower :-) like clip – Boner 15/8, 2019 at 14:44

Good suggestion, made an edit @WeNYoBen – Nobie 15/8, 2019 at 14:50

This can be done using numpy but is slower than the numba solution.

sumlm = np.frompyfunc(lambda a,b: 0 if a+b < 0 else a+b,2,1)
newx=sumlm.accumulate(df.Value.values, dtype=np.object)
newx
Out[147]: array([7, 9, 3, 0, 8, 8], dtype=object)

Here is the numba solution

from numba import njit
@njit
def cumli(x, lim):
    total = 0
    result = []
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0
        result.append(total)
    return result
cumli(df.Value.values,0)
Out[166]: [7, 9, 3, 0, 8, 8]

Boner answered 15/8, 2019 at 14:25 Comment(11)

Awsome, works as expected. I will get back to you after testing the time. – Peptone 15/8, 2019 at 14:28

Was this method ever tested against the solutions of divakar and pirsquared? Just wondering in terms of speed – Nobie 15/8, 2019 at 14:28

@Nobie No sure about the speed but since I list the link above , so op can pick the one he want :-) – Boner 15/8, 2019 at 14:29

@Nobie test out slow than numba , I will attached the numba solution – Boner 15/8, 2019 at 14:32

I am comparing your solution with the numba solution in the link you posted (in terms of time) – Peptone 15/8, 2019 at 14:35

@SreeramTP Divakar method is the best in term of timing – Boner 15/8, 2019 at 14:39

Yeah, I could see that. Numba method is a bit faster than the pure python loop and the solution posted by you. If you could add the solution using numba I can go ahead and accept the answer – Peptone 15/8, 2019 at 14:45

@SreeramTP here you go , I would say my original method is more readable , but slowest . – Boner 15/8, 2019 at 14:51

Yeah true. But it serves the purpose. – Peptone 15/8, 2019 at 14:53

You can get approx. a factor of 10 speedup if you use arrays instead of lists. eg. result = np.empty_like(x); idx=0 write the result to the array like this result[idx]=total; idx+=1 and shrink the array at the end return result[:idx] – Decolorize 15/8, 2019 at 17:36

@Decolorize can you elaborate.? Maybe consider adding an answer. – Peptone 16/8, 2019 at 5:34

This is only a comment WeNYoBen.

If you can avoid lists it is usually recommendable to avoid it.

Example

from numba import njit
import numpy as np

#with lists
@njit()
def cumli(x, lim):
    total = 0
    result = []
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0
        result.append(total)
    return result

#without lists
@njit()
def cumli_2(x, lim):
    total = 0.
    result = np.empty_like(x)
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0.
        result[i]=total
    return result

Timings

Without Numba (comment out@njit()):

x=(np.random.rand(1_000)-0.5)*5

  %timeit a=cumli(x, 0.)
  220 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  %timeit a=cumli_2(x, 0.)
  227 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

There is no difference between using lists or arrays. But that's not the case if you Jit-compile this function.

With Numba:

  %timeit a=cumli(x, 0.)
  27.4 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  %timeit a=cumli_2(x, 0.)
  2.96 µs ± 32.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Even in a bit more complicated cases (final array size unknown, or only max array size known) it often makes sense to allocate an array and shrink it at the end, or in simple cases even to run the algorithm once to know the final array size and than do the real calculation.

Decolorize answered 16/8, 2019 at 8:46 Comment(0)

Recommended topics

Hot tags