Pandas group by cumsum keep columns
Asked Answered
H

2

14

I have spent a few hours now trying to do a "cumulative group by sum" on a pandas dataframe. I have looked at all the stackoverflow answers and surprisingly none of them can solve my (very elementary) problem:

I have a dataframe:

df1 Out[8]: Name Date Amount 0 Jack 2016-01-31 10 1 Jack 2016-02-29 5 2 Jack 2016-02-29 8 3 Jill 2016-01-31 10 4 Jill 2016-02-29 5

I am trying to

  1. group by ['Name','Date'] and
  2. cumsum 'Amount'.
  3. That is it.

So the desired output is:

df1 Out[10]: Name Date Cumsum 0 Jack 2016-01-31 10 1 Jack 2016-02-29 23 2 Jill 2016-01-31 10 3 Jill 2016-02-29 15

EDIT: I am simplifying the question. With the current answers I still can't get the correct "running" cumsum. Look closely, I want to see the cumulative sum "10, 23, 10, 15". In words, I want to see, at every consecutive date, the total cumulative sum for a person. NB: If there are two entries on one date for the same person, I want to sum those and then add them to the running cumsum and only then print the sum.

Henig answered 23/1, 2017 at 14:42 Comment(0)
R
16

You need assign output to new column and then remove Amount column by drop:

df1['Cumsum'] = df1.groupby(by=['Name','Date'])['Amount'].cumsum()
df1 = df1.drop('Amount', axis=1)
print (df1)
   Name        Date  Cumsum
0  Jack  2016-01-31      10
1  Jack  2016-02-29       5
2  Jack  2016-02-29      13
3  Jill  2016-01-31      10
4  Jill  2016-02-29       5

Another solution with assign:

df1 = df1.assign(Cumsum=df1.groupby(by=['Name','Date'])['Amount'].cumsum())
         .drop('Amount', axis=1)
print (df1)
   Name        Date  Cumsum
0  Jack  2016-01-31      10
1  Jack  2016-02-29       5
2  Jack  2016-02-29      13
3  Jill  2016-01-31      10
4  Jill  2016-02-29       5

EDIT by comment:

First groupby columns Name and Date and aggregate sum, then groupby by level Name and aggregate cumsum.

df = df1.groupby(by=['Name','Date'])['Amount'].sum()
        .groupby(level='Name').cumsum().reset_index(name='Cumsum')
print (df)
   Name        Date  Cumsum
0  Jack  2016-01-31      10
1  Jack  2016-02-29      23
2  Jill  2016-01-31      10
3  Jill  2016-02-29      15
Roberts answered 23/1, 2017 at 14:43 Comment(1)
Thanks for the response, however the second group should lump together Jack's multiple 2016-02-29 Amounts. So the Cumsum must only have four rows reading "10, 23, 10, 15". I will try to work with what you have provided nevertheless, thanks.Henig
P
11

Set the index first, then groupby.

df.set_index(['Name', 'Date']).groupby(level=[0, 1]).Amount.cumsum().reset_index()

enter image description here


After the OP changed their question, this is now the correct answer.

df1.groupby(
    ['Name','Date']
)Amount.sum().groupby(
    level='Name'
).cumsum()

This is the same answer provided by jezrael

Papillary answered 23/1, 2017 at 14:49 Comment(5)
Thanks for the response. As per my edit, your solution does not give the desired output, however I believe you have provided good direction. Thank youHenig
Whoever gave me a down vote, please reconsider as the OP changed their question after the answer was given. I didn't want to appear to copy jezrael's answer so I just referred to it.Papillary
Hi piRSquared, I downvoted because the original desired output (even before the EDIT) was not achieved with your code - so at least not an upvote imho. However in hindsight you did provide set_index which solved my secondary problem so I will upvote. Thanks againHenig
Thank you for the prompt response and the helpHenig
Just a note: This solved an issue for me with cumsum() not including the groupby column. So thanks!Torsk

© 2022 - 2024 — McMap. All rights reserved.