Group by two columns to get sum of another column

Asked 11/11, 2016 at 17:14 Answered 19/3, 2023 at 2:14

Solved python-3.x pandas dataframe group-by sum

I look most of the previously asked questions but was not able to find answer for my question:

I have following dataframe

           id   year month score num_attempts
0      483625  2010    01   50      1
1      967799  2009    03   50      1
2      213473  2005    09  100      1
3      498110  2010    12   60      1
5      187243  2010    01  100      1
6      508311  2005    10   15      1
7      486688  2005    10   50      1
8      212550  2005    10  500      1
10     136701  2005    09   25      1
11     471651  2010    01   50      1

I want to get the following dataframe

year month sum_score sum_num_attempts
2009    03   50           1
2005    09  125           2
2010    12   60           1
2010    01  200           2
2005    10  565           3

Here is what I tried:

sum_df = df.groupby(by=['year','month'])['score'].sum()

But this doesn't look efficient and correct. If I have more than one column need to be aggregate this seems like a very expensive call. for example if I have another column num_attempts and just want to sum by year month as score.

Exodontics answered 11/11, 2016 at 17:14 Comment(7)

Can you elaborate why you think it's an expansive call, and what you would do with num_attempts if you want to keep it in the group operation? – Weak 11/11, 2016 at 17:27

@MaxU not sure if i understand the use of `as_index=False, what does that do? – Exodontics 11/11, 2016 at 17:31

sorry for my first comment - i've misunderstood you. What is wrong with: df.groupby(['year','month'])[['score','num_attempts']].sum() or df.groupby(['year','month']).agg({'score':'sum', 'num_attempts':'max'})? – Spratt 11/11, 2016 at 17:37

@MaxU so will this create new columns in the resulting data frame if i want in a different column to reflect the results can I do that. Also my data frame is really large. is this the best way to carry out this task. – Exodontics 11/11, 2016 at 17:40

@Boud I have close to 9million records in this data frame and seems like its taking forever to get the results and I still haven't seen the final frame. – Exodontics 11/11, 2016 at 17:51

can you provide an example of your actual dataframe? – Court 11/11, 2016 at 18:18

@DennisGolomazov just updated – Exodontics 11/11, 2016 at 18:26

This should be an efficient way:

sum_df = df.groupby(['year','month']).agg({'score': 'sum', 'num_attempts': 'sum'})

Court answered 11/11, 2016 at 18:28 Comment(4)

Does the score column has to be explicitly integer? – Exodontics 12/11, 2016 at 16:21

@Null-Hypothesis I don't think so. If it doesn't work, will you elaborate? – Court 16/11, 2016 at 4:0

The resulted sum_df only have the summed columns. – Onitaonlooker 17/11, 2020 at 13:20

if anyone also wants to rename the column that gets created, so that the new(score) column doesn't take the same name as input column. append .rename(columns={'score': 'score_sum_by_year_month'}) at the end. – Tinkle 28/9, 2021 at 9:34

An alternative way is to call groupby.sum on multiple columns and add prefix afterwards. This produces the desired column labels in the OP.

sum_df = df.groupby(['year', 'month'])[['score', 'num_attempts']].sum().add_prefix('sum_').reset_index()

or use named aggregation from the beginning

sum_df = df.groupby(['year', 'month'], as_index=False).agg(sum_score=('score', 'sum'), sum_num_attempts=('num_attempts', 'sum'))

Both produce the following frame:

Complicated answered 19/3, 2023 at 2:14 Comment(0)

Recommended topics

Hot tags