Pandas groupby cumulative sum
Asked Answered
D

8

154

I would like to add a cumulative sum column to my Pandas dataframe so that:

name day no
Jack Monday 10
Jack Tuesday 20
Jack Tuesday 10
Jack Wednesday 50
Jill Monday 40
Jill Wednesday 110

becomes:

Jack | Monday     | 10  | 10
Jack | Tuesday    | 30  | 40
Jack | Wednesday  | 50  | 90
Jill | Monday     | 40  | 40
Jill | Wednesday  | 110 | 150

I tried various combos of df.groupby and df.agg(lambda x: cumsum(x)) to no avail.

Disapprobation answered 26/3, 2014 at 3:17 Comment(1)
To create both columns using a one-liner, use this answer.Daciadacie
D
148

This should do it, need groupby() twice:

df.groupby(['name', 'day']).sum() \
  .groupby(level=0).cumsum().reset_index()

Explanation:

print(df)
   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110

# sum per name/day
print( df.groupby(['name', 'day']).sum() )
                 no
name day           
Jack Monday      10
     Tuesday     30
     Wednesday   50
Jill Monday      40
      Wednesday  110

# cumulative sum per name/day
print( df.groupby(['name', 'day']).sum() \
         .groupby(level=0).cumsum() )
                 no
name day           
Jack Monday      10
     Tuesday     40
     Wednesday   90
Jill Monday      40
     Wednesday  150

The dataframe resulting from the first sum is indexed by 'name' and by 'day'. You can see it by printing

df.groupby(['name', 'day']).sum().index 

When computing the cumulative sum, you want to do so by 'name', corresponding to the first index (level 0).

Finally, use reset_index to have the names repeated.

df.groupby(['name', 'day']).sum().groupby(level=0).cumsum().reset_index()

   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   40
2  Jack  Wednesday   90
3  Jill     Monday   40
4  Jill  Wednesday  150
Dews answered 26/3, 2014 at 3:56 Comment(1)
What a brute method to achieve the result, wished this was simple in pandasTanberg
R
83

Modification to @Dmitry's answer. This is simpler and works in pandas 0.19.0:

print(df) 

 name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110

df['no_csum'] = df.groupby(['name'])['no'].cumsum()

print(df)
   name        day   no  no_csum
0  Jack     Monday   10       10
1  Jack    Tuesday   20       30
2  Jack    Tuesday   10       40
3  Jack  Wednesday   50       90
4  Jill     Monday   40       40
5  Jill  Wednesday  110      150
Ringmaster answered 30/3, 2018 at 16:49 Comment(1)
This works but you need to be careful with the order of the 'day' column. For example, if 'day' was in alphabetical order, 'no_csum' probably wouldn't reflect the information you actually need.Lunitidal
P
60

This works in pandas 0.16.2

In[23]: print df
        name          day   no
0      Jack       Monday    10
1      Jack      Tuesday    20
2      Jack      Tuesday    10
3      Jack    Wednesday    50
4      Jill       Monday    40
5      Jill    Wednesday   110
In[24]: df['no_cumulative'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
In[25]: print df
        name          day   no  no_cumulative
0      Jack       Monday    10             10
1      Jack      Tuesday    20             30
2      Jack      Tuesday    10             40
3      Jack    Wednesday    50             90
4      Jill       Monday    40             40
5      Jill    Wednesday   110            150
Percyperdido answered 7/12, 2015 at 10:3 Comment(1)
df.groupby(['name'])['no'].cumsum() also works fine.Getter
P
13

you should use

df['cum_no'] = df.no.cumsum()

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.cumsum.html

Another way of doing it

import pandas as pd
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
           'C2' : [1,2,3,4,5]})
df['cumsum'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.cumsum())
df

enter image description here

Plexiform answered 26/4, 2017 at 4:33 Comment(0)
R
9

Instead of df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum() (see above) you could also do a df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()

  • df.groupby(by=['name','day']).sum() is actually just moving both columns to a MultiIndex
  • as_index=False means you do not need to call reset_index afterwards
Redaredact answered 19/7, 2017 at 10:40 Comment(0)
A
2

data.csv:

name,day,no
Jack,Monday,10
Jack,Tuesday,20
Jack,Tuesday,10
Jack,Wednesday,50
Jill,Monday,40
Jill,Wednesday,110

Code:

import numpy as np
import pandas as pd

df = pd.read_csv('data.csv')
print(df)
df = df.groupby(['name', 'day'])['no'].sum().reset_index()
print(df)
df['cumsum'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
print(df)

Output:

   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110
   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   30
2  Jack  Wednesday   50
3  Jill     Monday   40
4  Jill  Wednesday  110
   name        day   no  cumsum
0  Jack     Monday   10      10
1  Jack    Tuesday   30      40
2  Jack  Wednesday   50      90
3  Jill     Monday   40      40
4  Jill  Wednesday  110     150
Arlon answered 4/11, 2020 at 10:56 Comment(0)
S
1

as of version 1.0 pandas got a new api for window functions.

specifically, what was achieved earlier with

df.groupby(['name'])['no'].apply(lambda x: x.cumsum())  

or

df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()

now becomes

df.groupby(['name'])['no'].expanding().sum()

I find it more intuitive for all window-related functions than groupby+level operations

although learning to use groupby is useful for general purpose.
see docs: https://pandas.pydata.org/docs/user_guide/window.html

Streetlight answered 14/9, 2022 at 19:48 Comment(0)
D
0

If you want to write a one-liner (perhaps you want to pass the methods into a pipeline), you can do so by first setting as_index parameter of groupby method to False to return a dataframe from the aggregation step and use assign() to assign a new column to it (the cumulative sum for each person).

These chained methods return a new dataframe, so you'll need to assign it to a variable (e.g. agg_df) to be able to use it later on.

agg_df = (
    # aggregate df by name and day
    df.groupby(['name','day'], as_index=False)['no'].sum()
    .assign(
        # assign the cumulative sum of each name as a new column
        cumulative_sum=lambda x: x.groupby('name')['no'].cumsum()
    )
)

res

Daciadacie answered 15/9, 2022 at 10:25 Comment(2)
How can we be sure that "cumsum" is executed in the "day" order?Pirog
@JigidiSarnath you’ll have to sort the groupby result by day (before the call to cumsum) if you want the cumsum to be executed in day order. See this post for ways to sort the frame.Daciadacie

© 2022 - 2024 — McMap. All rights reserved.