Group by two columns to get sum of another column
Asked Answered
E

2

22

I look most of the previously asked questions but was not able to find answer for my question:

I have following dataframe

           id   year month score num_attempts
0      483625  2010    01   50      1
1      967799  2009    03   50      1
2      213473  2005    09  100      1
3      498110  2010    12   60      1
5      187243  2010    01  100      1
6      508311  2005    10   15      1
7      486688  2005    10   50      1
8      212550  2005    10  500      1
10     136701  2005    09   25      1
11     471651  2010    01   50      1

I want to get the following dataframe

year month sum_score sum_num_attempts
2009    03   50           1
2005    09  125           2
2010    12   60           1
2010    01  200           2
2005    10  565           3

Here is what I tried:

sum_df = df.groupby(by=['year','month'])['score'].sum()

But this doesn't look efficient and correct. If I have more than one column need to be aggregate this seems like a very expensive call. for example if I have another column num_attempts and just want to sum by year month as score.

Exodontics answered 11/11, 2016 at 17:14 Comment(7)
Can you elaborate why you think it's an expansive call, and what you would do with num_attempts if you want to keep it in the group operation?Weak
@MaxU not sure if i understand the use of `as_index=False, what does that do?Exodontics
sorry for my first comment - i've misunderstood you. What is wrong with: df.groupby(['year','month'])[['score','num_attempts']].sum() or df.groupby(['year','month']).agg({'score':'sum', 'num_attempts':'max'})?Spratt
@MaxU so will this create new columns in the resulting data frame if i want in a different column to reflect the results can I do that. Also my data frame is really large. is this the best way to carry out this task.Exodontics
@Boud I have close to 9million records in this data frame and seems like its taking forever to get the results and I still haven't seen the final frame.Exodontics
can you provide an example of your actual dataframe?Court
@DennisGolomazov just updatedExodontics
C
25

This should be an efficient way:

sum_df = df.groupby(['year','month']).agg({'score': 'sum', 'num_attempts': 'sum'})
Court answered 11/11, 2016 at 18:28 Comment(4)
Does the score column has to be explicitly integer?Exodontics
@Null-Hypothesis I don't think so. If it doesn't work, will you elaborate?Court
The resulted sum_df only have the summed columns.Onitaonlooker
if anyone also wants to rename the column that gets created, so that the new(score) column doesn't take the same name as input column. append .rename(columns={'score': 'score_sum_by_year_month'}) at the end.Tinkle
C
0

An alternative way is to call groupby.sum on multiple columns and add prefix afterwards. This produces the desired column labels in the OP.

sum_df = df.groupby(['year', 'month'])[['score', 'num_attempts']].sum().add_prefix('sum_').reset_index()

or use named aggregation from the beginning

sum_df = df.groupby(['year', 'month'], as_index=False).agg(sum_score=('score', 'sum'), sum_num_attempts=('num_attempts', 'sum'))

Both produce the following frame:

res

Complicated answered 19/3, 2023 at 2:14 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.