How do I Pandas group-by to get sum?
Asked Answered
P

11

391

I am using this dataframe:

Fruit   Date      Name  Number
Apples  10/6/2016 Bob    7
Apples  10/6/2016 Bob    8
Apples  10/6/2016 Mike   9
Apples  10/7/2016 Steve 10
Apples  10/7/2016 Bob    1
Oranges 10/7/2016 Bob    2
Oranges 10/6/2016 Tom   15
Oranges 10/6/2016 Mike  57
Oranges 10/6/2016 Bob   65
Oranges 10/7/2016 Tony   1
Grapes  10/7/2016 Bob    1
Grapes  10/7/2016 Tom   87
Grapes  10/7/2016 Bob   22
Grapes  10/7/2016 Bob   12
Grapes  10/7/2016 Tony  15

I would like to aggregate this by Name and then by Fruit to get a total number of Fruit per Name. For example:

Bob,Apples,16

I tried grouping by Name and Fruit but how do I get the total number of Fruit?

Perforce answered 7/10, 2016 at 17:36 Comment(1)
you can use dfsql df.sql('SELECT fruit, sum(number) GROUP BY fruit') github.com/mindsdb/dfsql medium.com/riselab/…Caligula
S
444

Use GroupBy.sum:

df.groupby(['Fruit','Name']).sum()

Out[31]: 
               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Grapes  Bob        35
        Tom        87
        Tony       15
Oranges Bob        67
        Mike       57
        Tom        15
        Tony        1

To specify the column to sum, use this: df.groupby(['Name', 'Fruit'])['Number'].sum()

Scheelite answered 7/10, 2016 at 17:37 Comment(2)
The question is, if data is read from excel, and "Number" should be a string by default when we read data from excel, how to use sum() function?Mer
##there are five columns data in 'overview.csv' temp = pd.read_csv("overview.csv") temp.groupby([temp.columns[0],temp.columns[1]])[temp.columns[4]].sum() print(temp) can not get the sum of 'temp.columns[4]'Mer
M
265

Also you can use agg function,

df.groupby(['Name', 'Fruit'])['Number'].agg('sum')
Mindamindanao answered 8/10, 2016 at 11:40 Comment(0)
B
209

If you want to keep the original columns Fruit and Name, use reset_index(). Otherwise Fruit and Name will become part of the index.

df.groupby(['Fruit','Name'])['Number'].sum().reset_index()

Fruit   Name       Number
Apples  Bob        16
Apples  Mike        9
Apples  Steve      10
Grapes  Bob        35
Grapes  Tom        87
Grapes  Tony       15
Oranges Bob        67
Oranges Mike       57
Oranges Tom        15
Oranges Tony        1

As seen in the other answers:

df.groupby(['Fruit','Name'])['Number'].sum()

               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Grapes  Bob        35
        Tom        87
        Tony       15
Oranges Bob        67
        Mike       57
        Tom        15
        Tony        1
Blaisdell answered 2/7, 2018 at 10:1 Comment(0)
B
68

Both the other answers accomplish what you want.

You can use the pivot functionality to arrange the data in a nice table

df.groupby(['Fruit','Name'],as_index = False).sum().pivot('Fruit','Name').fillna(0)



Name    Bob     Mike    Steve   Tom    Tony
Fruit                   
Apples  16.0    9.0     10.0    0.0     0.0
Grapes  35.0    0.0     0.0     87.0    15.0
Oranges 67.0    57.0    0.0     15.0    1.0
Bouton answered 7/10, 2016 at 18:35 Comment(0)
I
38
df.groupby(['Fruit','Name'])['Number'].sum()

You can select different columns to sum numbers.

Irrawaddy answered 11/3, 2018 at 0:29 Comment(0)
H
25

A variation on the .agg() function; provides the ability to (1) persist type DataFrame, (2) apply averages, counts, summations, etc. and (3) enables groupby on multiple columns while maintaining legibility.

df.groupby(['att1', 'att2']).agg({'att1': "count", 'att3': "sum",'att4': 'mean'})

using your values...

df.groupby(['Name', 'Fruit']).agg({'Number': "sum"})
Hako answered 2/2, 2020 at 8:25 Comment(0)
D
13

You can set the groupby column to index then using sum with level

df.set_index(['Fruit','Name']).sum(level=[0,1])
Out[175]: 
               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Oranges Bob        67
        Tom        15
        Mike       57
        Tony        1
Grapes  Bob        35
        Tom        87
        Tony       15
Deferential answered 21/11, 2018 at 3:1 Comment(0)
C
11

You could also use transform() on column Number after group by. This operation will calculate the total number in one group with function sum, the result is a series with the same index as original dataframe.

df['Number'] = df.groupby(['Fruit', 'Name'])['Number'].transform('sum')
df = df.drop_duplicates(subset=['Fruit', 'Name']).drop('Date', 1)

Then, you can drop the duplicate rows on column Fruit and Name. Moreover, you can drop the column Date by specifying axis 1 (0 for rows and 1 for columns).

# print(df)

      Fruit   Name  Number
0    Apples    Bob      16
2    Apples   Mike       9
3    Apples  Steve      10
5   Oranges    Bob      67
6   Oranges    Tom      15
7   Oranges   Mike      57
9   Oranges   Tony       1
10   Grapes    Bob      35
11   Grapes    Tom      87
14   Grapes   Tony      15

# You could achieve the same result with functions discussed by others: 
# print(df.groupby(['Fruit', 'Name'], as_index=False)['Number'].sum())
# print(df.groupby(['Fruit', 'Name'], as_index=False)['Number'].agg('sum'))

There is an official tutorial Group by: split-apply-combine talking about what you can do after group by.

Clodhopping answered 18/3, 2021 at 11:52 Comment(3)
Hi Guys, your solution indeed work! my python version is 3.8, it seems that it did work if we only use sum().Mer
@Mer Don't understand, you only say it works so when does it not work?Clodhopping
Ynjxsjmh, I mean if I just use 'df['Number'] = df.groupby(['Fruit', 'Name'])['Number'].transform('sum')', I can not get the sum of 'Number' grouped by 'Fruit', 'Name' pair. But, if I add the line as your comment suggested, df = df.drop_duplicates(subset=['Fruit', 'Name'])', then I got the sum expected.Mer
R
3

If you want the aggregated column to have a custom name such as Total Number, Total etc. (all the solutions on here results in a dataframe where the aggregate column is named Number), use named aggregation:

df.groupby(['Fruit', 'Name'], as_index=False).agg(**{'Total Number': ('Number', 'sum')})

or (if the custom name doesn't need to have a white space in it):

df.groupby(['Fruit', 'Name'], as_index=False).agg(Total=('Number', 'sum'))

this is equivalent to SQL query:

SELECT Fruit, Name, sum(Number) AS Total
FROM df 
GROUP BY Fruit, Name

Speaking of SQL, there's pandasql module that allows you to query pandas dataFrames in the local environment using SQL syntax. It's not part of Pandas, so will have to be installed separately.

#! pip install pandasql
from pandasql import sqldf
sqldf("""
SELECT Fruit, Name, sum(Number) AS Total
FROM df 
GROUP BY Fruit, Name
""")
Recalcitrate answered 7/7, 2022 at 23:52 Comment(0)
C
1

You can use dfsql
for your problem, it will look something like:

df.sql('SELECT fruit, sum(number) GROUP BY fruit')

https://github.com/mindsdb/dfsql

here is an article about it:

https://medium.com/riselab/why-every-data-scientist-using-pandas-needs-modin-bringing-sql-to-dataframes-3b216b29a7c0

Caligula answered 20/4, 2021 at 6:36 Comment(0)
N
1

You can use reset_index() to reset the index after the sum

df.groupby(['Fruit','Name'])['Number'].sum().reset_index()

or

df.groupby(['Fruit','Name'], as_index=False)['Number'].sum()
Neurophysiology answered 16/12, 2022 at 8:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.