Perform cumulative sum over a column but reset to 0 if sum become negative in Pandas
Asked Answered
P

2

15

I have a pandas dataframe with two columns like this,

Item    Value
0   A   7
1   A   2
2   A   -6
3   A   -70
4   A   8
5   A   0

I want to cumulative sum over the column, Value. But while creating the cumulative sum if the value becomes negative I want to reset it back to 0.

I am currently using a loop shown below to perform this,

sum_ = 0
cumsum = []

for val in sample['Value'].values:
    sum_ += val
    if sum_ < 0:
        sum_ = 0
    cumsum.append(sum_)

print(cumsum) # [7, 9, 3, 0, 8, 8]

I am looking for a more efficient way to perform this in pure pandas.

Peptone answered 15/8, 2019 at 13:44 Comment(10)
I think we do not have pandas method can achieve thisBoner
I was thinking the same and finnally settled with the solution with the loop i posted in the question. I was wondering I am missing out some pandas trick that could do the magicPeptone
What you did is more like what I can offer, only little different I may using numbaBoner
I am not familiar with numba. How much improvement of performance (in terms of time) can I expect.? If you can post the code as answer I will check for myself and let you know whether it will be good for me.Peptone
In term of performance pure python is not bad :-)Boner
Can you post the code if possible.?Peptone
#56904890 you can get some solution from there , even that is not 100% sameBoner
Great question, posted this as an improvement suggestion to pandas on GitHub @WeNYoBenNobie
@Nobie maybe it is better adding the upper and lower :-) like clipBoner
Good suggestion, made an edit @WeNYoBenNobie
B
9

This can be done using numpy but is slower than the numba solution.

sumlm = np.frompyfunc(lambda a,b: 0 if a+b < 0 else a+b,2,1)
newx=sumlm.accumulate(df.Value.values, dtype=np.object)
newx
Out[147]: array([7, 9, 3, 0, 8, 8], dtype=object)

Here is the numba solution

from numba import njit
@njit
def cumli(x, lim):
    total = 0
    result = []
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0
        result.append(total)
    return result
cumli(df.Value.values,0)
Out[166]: [7, 9, 3, 0, 8, 8]
Boner answered 15/8, 2019 at 14:25 Comment(11)
Awsome, works as expected. I will get back to you after testing the time.Peptone
Was this method ever tested against the solutions of divakar and pirsquared? Just wondering in terms of speedNobie
@Nobie No sure about the speed but since I list the link above , so op can pick the one he want :-)Boner
@Nobie test out slow than numba , I will attached the numba solutionBoner
I am comparing your solution with the numba solution in the link you posted (in terms of time)Peptone
@SreeramTP Divakar method is the best in term of timingBoner
Yeah, I could see that. Numba method is a bit faster than the pure python loop and the solution posted by you. If you could add the solution using numba I can go ahead and accept the answerPeptone
@SreeramTP here you go , I would say my original method is more readable , but slowest .Boner
Yeah true. But it serves the purpose.Peptone
You can get approx. a factor of 10 speedup if you use arrays instead of lists. eg. result = np.empty_like(x); idx=0 write the result to the array like this result[idx]=total; idx+=1 and shrink the array at the end return result[:idx]Decolorize
@Decolorize can you elaborate.? Maybe consider adding an answer.Peptone
D
1

This is only a comment WeNYoBen.

If you can avoid lists it is usually recommendable to avoid it.

Example

from numba import njit
import numpy as np

#with lists
@njit()
def cumli(x, lim):
    total = 0
    result = []
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0
        result.append(total)
    return result

#without lists
@njit()
def cumli_2(x, lim):
    total = 0.
    result = np.empty_like(x)
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0.
        result[i]=total
    return result

Timings

Without Numba (comment out@njit()):

x=(np.random.rand(1_000)-0.5)*5

  %timeit a=cumli(x, 0.)
  220 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  %timeit a=cumli_2(x, 0.)
  227 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

There is no difference between using lists or arrays. But that's not the case if you Jit-compile this function.

With Numba:

  %timeit a=cumli(x, 0.)
  27.4 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  %timeit a=cumli_2(x, 0.)
  2.96 µs ± 32.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Even in a bit more complicated cases (final array size unknown, or only max array size known) it often makes sense to allocate an array and shrink it at the end, or in simple cases even to run the algorithm once to know the final array size and than do the real calculation.

Decolorize answered 16/8, 2019 at 8:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.