pandas groupby aggregate customised function with multiple columns

Asked 14/1, 2019 at 14:37 Answered 14/1, 2019 at 17:17

Solved python pandas aggregate pandas-groupby

I am trying to use a customised function with groupby in pandas. I find that using apply allows me to do that in the following way:

(An example which calculates a new mean from two groups)

import pandas as pd

def newAvg(x):
    x['cm'] = x['count']*x['mean']
    sCount = x['count'].sum()
    sMean = x['cm'].sum()
    return sMean/sCount

data = [['A', 4, 2.5], ['A', 3, 6], ['B', 4, 9.5], ['B', 3, 13]]
df = pd.DataFrame(data, columns=['pool', 'count', 'mean'])

df_gb = df.groupby(['pool']).apply(newAvg)

Is it possible to integrate this into an agg function? Along these lines:

df.groupby(['pool']).agg({'count': sum, ['count', 'mean']: apply(newAvg)})

Oleviaolfaction answered 14/1, 2019 at 14:37 Comment(0)

IIUC

df.groupby(['pool']).apply(lambda x : pd.Series({'count':sum(x['count']),'newavg':newAvg(x)}))
Out[58]: 
      count  newavg
pool               
A       7.0     4.0
B       7.0    11.0

Guss answered 14/1, 2019 at 14:48 Comment(2)

I like this a lot. Thanks to everyone though – Oleviaolfaction 14/1, 2019 at 14:52

@Oleviaolfaction happy coding – Guss 14/1, 2019 at 14:52

Function agg working with each column separately, so possible solution is create column cm first with assign and then aggregate sum, last divide each columns:

df_gb = df.assign(cm=df['count']*df['mean']).groupby('pool')['cm','count'].sum()
print (df_gb)
        cm  count
pool             
A     28.0      7
B     77.0      7

out = df_gb.pop('cm') / df_gb.pop('count')
print (out)
pool
A     4.0
B    11.0
dtype: float64

Tails answered 14/1, 2019 at 14:38 Comment(0)

Use assign with eval:

df.assign(cm=df['count']*df['mean'])\
  .groupby('pool', as_index=False)['cm','count'].sum()\
  .eval('AggCol = cm / count')

Output:

  pool    cm  count  AggCol
0    A  28.0      7     4.0
1    B  77.0      7    11.0

Vaporescence answered 14/1, 2019 at 14:52 Comment(0)

A dictionary with agg is used to perform separate calculations for each series. For your problem, I suggest pd.concat:

g = df.groupby('pool')
res = pd.concat([g['count'].sum(), g.apply(newAvg).rename('newAvg')], axis=1)

print(res)

#       count  newAvg
# pool               
# A         7     4.0
# B         7    11.0

This isn't the most efficient solution as your function newAvg is performing calculations which can be performed on the entire dataframe initially, but it does support arbitrary pre-defined calculations.

Shelve answered 14/1, 2019 at 14:44 Comment(0)

If you are calculating a weighted average, you can do it easily using agg and NumPy np.average function. Just read the Series for the 'mean' column:

df_gb = df.groupby(['pool']).agg(lambda x: np.average(x['mean'], weights=x['count']))['mean']

You could also do it using your newAvg function, although this will produce warnings:

df_gb2 = df.groupby(['pool']).agg(newAvg)['mean']

If you are willing to use newAvg function, you can redefine it to avoid working on copies:

def newAvg(x):
    cm = x['count']*x['mean']
    sCount = x['count'].sum()
    sMean = cm.sum()
    return sMean/sCount

With this modification, you get your expected output:

df_gb2 = df.groupby(['pool']).agg(newAvg)['mean']
print(df_gb2)

# pool
# A     4.0
# B    11.0
# Name: mean, dtype: float64

Alemanni answered 14/1, 2019 at 17:17 Comment(0)

Recommended topics

Hot tags