Pandas: Aggregate of DataFrameGroupby

Asked 19/7, 2020 at 12:8 Answered 19/7, 2020 at 12:53

Solved python pandas group-by aggregate pandas-groupby

I am working with this data set census.csv

Code:

df = pd.read_csv('Data/census.csv')
df = df[df['SUMLEV']==50]

print(df.head())

This is how my data looks like:

   SUMLEV  REGION  DIVISION  ...  RNETMIG2013  RNETMIG2014 RNETMIG2015
1      50       3         6  ...    -2.722002     2.592270   -2.187333
2      50       3         6  ...    22.727626    20.317142   18.293499
3      50       3         6  ...    -7.167664    -3.978583  -10.543299
4      50       3         6  ...    -5.403729     0.754533    1.107861
5      50       3         6  ...    -1.402476    -1.577232   -0.884411

I wanted to take the aggregate of two columns after grouping by 'STNAME':

(df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'].agg(Avg= np.average, Sum = np.sum))

Error:

----> 3 (df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'].agg(Avg= np.average, Sum = np.sum))

f:\software_installations\anaconda3\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
    922         elif func is None:
    923             # nicer error message
--> 924             raise TypeError("Must provide 'func' or tuples of '(column, aggfunc).")
    925 
    926         func = _maybe_mangle_lambdas(func)

TypeError: Must provide 'func' or tuples of '(column, aggfunc).

Ademption answered 19/7, 2020 at 12:8 Comment(0)

While others were fast to give the one-liner code snippet, I try to explain a bit more about kind of options you have, and what is the syntax understood by the pandas agg() function.

What are we dealing with?

The type of the object you are dealing with is

type(df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']])
# pandas.core.groupby.generic.DataFrameGroupBy

Therefore, it might be a good start to time look at the documentation of the pandas.DataFrameGroupby.agg

What do the docs say?

The argument given to agg can be

(1) string (function name)
(2) function
(3) list of functions
(4) dict of column  names -> functions (or list of functions).

(1) string (function name)

If you give function name string as the argument, it must be a "function name that pandas understands". Understood function names are at least: 'sum','mean','std'. Example:

In [24]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg('mean')
Out[24]:
                      POPESTIMATE2010  POPESTIMATE2011
STNAME
Alabama                  71420.313433     71658.328358
Alaska                   24621.413793     24921.379310
Arizona                 427213.866667    431248.800000
...

Since you want two things to be calculated: mean and sum, you need two function calls. One with 'mean' and one with 'sum'.

(2) function

You can also give any function as an argument. The function should take array-like data (pd.Series) as input, and produce scalar value out of it. Example:

In [25]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg(np.mean)
Out[25]:
                      POPESTIMATE2010  POPESTIMATE2011
STNAME
Alabama                  71420.313433     71658.328358
Alaska                   24621.413793     24921.379310
...

Since you want two things to be calculated: mean and sum, you need two function calls. One with np.mean and one with np.sum.

(3) list of functions

You may also give a list of functions for the argument for agg(). Example:

In [27]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg([np.mean, np.sum])
Out[27]:
                     POPESTIMATE2010           POPESTIMATE2011
                                mean       sum            mean       sum
STNAME
Alabama                 71420.313433   4785161    71658.328358   4801108
Alaska                  24621.413793    714021    24921.379310    722720
Arizona                427213.866667   6408208   431248.800000   6468732
...

This has the upside that you need only one function call. Use this, if you want the same operations to be calculated with multiple columns

(4) Dict to column names -> functions

If you give a dictionary as the argument to agg(), then the keys must represent the column names in the dataframe, and the values should be either functions or lists of functions. Example:

In [30]: In [27]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg({"POPESTIMATE2010": [np.mean, np.sum], "POPESTIMATE2011": [np.mean, np.sum]})
Out[30]:
                     POPESTIMATE2010           POPESTIMATE2011
                             average       sum            mean       sum
STNAME
Alabama                 71420.313433   4785161    71658.328358   4801108
Alaska                  24621.413793    714021    24921.379310    722720
Arizona                427213.866667   6408208   431248.800000   6468732
...

This has the upside that you need only one function call. Use this, if you want different operations to be calculated with different columns

Dust answered 19/7, 2020 at 12:45 Comment(3)

Impressive, appreciate the effort. – Dodi 19/7, 2020 at 12:48

You explained every option with a use case which is exactly what i needed to understand. Thanks! – Ademption 19/7, 2020 at 13:2

Would be nice to wrap the example to 70 or 88 characters using additional outer parenthesis if needed. – Candelabra 17/3, 2022 at 22:52

How about:

df.groupby('STNAME')[['POPESTIMATE2010','POPESTIMATE2011']].agg(['mean', 'sum'])

Note that you need double square bracket after groupby in this case.

Crevice answered 19/7, 2020 at 12:16 Comment(3)

No, you need one pair of brackets [] to access the columns. Inside the bracket, you need to pass either a column or a list of columns. Here, you want a list, so you need two pairs. – Crevice 19/7, 2020 at 12:44

This works! But its also working with single bracket as well. – Ademption 19/7, 2020 at 12:44

I was also surprised it worked with a single pair of brackets (using pandas 0.24.2). I don't know if it is a feature of groupby() which makes this possible, since accessing df['POPESTIMATE2010','POPESTIMATE2011'] will raise a KeyError (as expected) . – Dust 19/7, 2020 at 12:58

try this,

import numpy as np

df.set_index('STNAME').groupby(level=0).agg(
    {"POPESTIMATE2010": [np.average, np.sum], "POPESTIMATE2011": [np.average, np.sum]})

Dodi answered 19/7, 2020 at 12:18 Comment(0)

As you see the Error it clearly saying that we have to specify a function or tuples of (column,aggfunc).

Following is the correct way to call agg function

df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg(['mean','sum'])

if you notice the double bracket after groupby ( [[ ]]) that because panda suggest that otherwise, you will get the warning like below

FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

also, you are using np.avarage that will give warming for empty slices like below :

RuntimeWarning: Mean of empty slice.

Lip answered 19/7, 2020 at 12:53 Comment(0)