Apply multiple functions to multiple groupby columns
Asked Answered
T

9

420

The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:

In [563]: grouped['D'].agg({'result1' : np.sum,
   .....:                   'result2' : np.mean})
   .....:
Out[563]: 
      result2   result1
A                      
bar -0.579846 -1.739537
foo -0.280588 -1.402938

However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.

What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.

For example, I've tried something like

grouped.agg({'C_sum' : lambda x: x['C'].sum(),
             'C_std': lambda x: x['C'].std(),
             'D_sum' : lambda x: x['D'].sum()},
             'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)

but as expected I get a KeyError (since the keys have to be a column if agg is called from a DataFrame).

Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?

Tambourine answered 25/1, 2013 at 20:26 Comment(0)
G
609

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.418500  0.030955  0.874869  0.145641      0
1  0.446069  0.901153  0.095052  0.487040      0
2  0.843026  0.936169  0.926090  0.041722      1
3  0.635846  0.439175  0.828787  0.714123      1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': lambda x: x.max() - x.min()})

              a                   b         c         d
            sum       max      mean       sum  <lambda>
group                                                  
0      0.864569  0.446069  0.466054  0.969921  0.341399
1      1.478872  0.843026  0.687672  1.754877  0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
    return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': max_min})

              a                   b         c             d
            sum       max      mean       sum Max minus Min
group                                                      
0      0.864569  0.446069  0.466054  0.969921      0.341399
1      1.478872  0.843026  0.687672  1.754877      0.672401

Using apply and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
    d = {}
    d['a_sum'] = x['a'].sum()
    d['a_max'] = x['a'].max()
    d['b_mean'] = x['b'].mean()
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

         a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.864569  0.446069  0.466054     0.173711
1      1.478872  0.843026  0.687672     0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
        d = []
        d.append(x['a'].sum())
        d.append(x['a'].max())
        d.append(x['b'].mean())
        d.append((x['c'] * x['d']).sum())
        return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], 
                                   ['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

              a                   b       c_d
            sum       max      mean   prodsum
group                                        
0      0.864569  0.446069  0.466054  0.173711
1      1.478872  0.843026  0.687672  0.630494
Gerardogeratology answered 3/11, 2017 at 19:44 Comment(15)
this is the only way I've found to aggregate a dataframe via multiple column inputs simulatneosly (the c_d example above)Caretaker
On large dataframes, this is very slow. What are ideas for more efficient solutions?Proponent
indeed a the neatest way, in Python, in R data.table package easily beat it in terms of shorter syntax and faster speed all you do is just $$df[,.(sum(a),min(a),max(a),sum(c*d)),keyby=.(group)]$$Kenji
@JejeBelfort "second half of the currently accepted answer is outdated" - this is not referring to this Answer, right?Morgun
@Morgun Sorry I don't understand your question. Are you sure it is addressed to me?Misplace
@Misplace I thought you might know because you were the last person to edit this post...but maybe TedPetrou is the best person to clarify this. The text I quoted is from the first line of this post.Morgun
@Morgun Oh ok I get it, so I think what happened is that before Ted posted his answer, it was the one from Zelazny7 below that was accepted. It turns out that the second part of Zelazny7's answer was deprecated back in the days and Ted answer was approved instead. Zelazny7's answer has now been updated by RK1 so I guess both answers work now ;)Misplace
Since you normally in the .agg function can pass a name like df.groupby("group").agg(max_func="max") how can you change the name when parsing the dictionary to .agg ?Helicograph
If the (new) columns have different types, dtype=object can be specified in pd.Series to prevent type conversions, and the columns in the resulting DataFrame are properly typed (not object).Sennacherib
This will also force int64 to become float within the same pd.Series. Also very slow. I think it's better to write your own aggregator.Nirvana
I can't get this to work for MultiIndex tablesTullus
The last line of function f can be made more robust with return pd.Series(d, index=list(d.keys()))Impression
I think this is the only way to apply multiple lambdas that each use multiple columns of the dataframe within one aggregation. Perfect solution!Mane
@Impression In this case it could even be simplified to pd.Series(d), as the documentation for Series states: If data is dict-like and index is None, then the keys in the data are used as the index.Sasnett
@Ted, how can we use first or lastor any other function with apply? is it something like x['str_column'].str.get(0)Cyclostyle
E
188

For the first part you can pass a dict of column names for keys and a list of functions for the values:

In [28]: df
Out[28]:
          A         B         C         D         E  GRP
0  0.395670  0.219560  0.600644  0.613445  0.242893    0
1  0.323911  0.464584  0.107215  0.204072  0.927325    0
2  0.321358  0.076037  0.166946  0.439661  0.914612    1
3  0.133466  0.447946  0.014815  0.130781  0.268290    1

In [26]: f = {'A':['sum','mean'], 'B':['prod']}

In [27]: df.groupby('GRP').agg(f)
Out[27]:
            A                   B
          sum      mean      prod
GRP
0    0.719580  0.359790  0.102004
1    0.454824  0.227412  0.034060

UPDATE 1:

Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.

Here's a hacky workaround:

In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}

In [69]: df.groupby('GRP').agg(f)
Out[69]:
            A                   B         D
          sum      mean      prod  <lambda>
GRP
0    0.719580  0.359790  0.102004  1.170219
1    0.454824  0.227412  0.034060  1.182901

Here, the resultant 'D' column is made up of the summed 'E' values.

UPDATE 2:

Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[] which selects only those rows meeting the criteria.

In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()

In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}

In [97]: df.groupby('GRP').agg(f)
Out[97]:
            A                   B         D
          sum      mean      prod   my name
GRP
0    0.719580  0.359790  0.102004  0.204072
1    0.454824  0.227412  0.034060  0.570441
Eriha answered 25/1, 2013 at 20:40 Comment(6)
Interesting, I can also pass a dict of {funcname: func} as values instead of lists to keep my custom names. But in either case I can't pass a lambda that uses other columns (like lambda x: x['D'][x['C'] < 3].sum() above: "KeyError: 'D'"). Any idea if that's possible?Tambourine
I've been trying to do exactly that, and I get the error KeyError: 'D'Eriha
Cool, I got it to work with df['A'].ix[g.index][df['C'] < 0].sum(). This is starting to get pretty messy, though--I think for readability manual looping may be preferable, plus I'm not sure there's a way to give it my preferred name in the agg argument (instead of <lambda>). I'll hold out hope that someone may know a more straightforward way...Tambourine
You can pass a dict for the column value {'D': {'my name':lambda function}} and it will make the inner dict key the column name.Eriha
I believe that pandas now supports multiple functions applied to a grouped-by dataframe: pandas.pydata.org/pandas-docs/stable/…Gaily
if one of functions is list, what?Subinfeudate
E
81

Pandas >= 0.25.0, named aggregations

Since pandas version 0.25.0 or higher, we are moving away from the dictionary based aggregation and renaming, and moving towards named aggregations which accepts a tuple. Now we can simultaneously aggregate + rename to a more informative column name:

Example:

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]

          a         b         c         d  group
0  0.521279  0.914988  0.054057  0.125668      0
1  0.426058  0.828890  0.784093  0.446211      0
2  0.363136  0.843751  0.184967  0.467351      1
3  0.241012  0.470053  0.358018  0.525032      1

Apply GroupBy.agg with named aggregation:

df.groupby('group').agg(
             a_sum=('a', 'sum'),
             a_mean=('a', 'mean'),
             b_mean=('b', 'mean'),
             c_sum=('c', 'sum'),
             d_range=('d', lambda x: x.max() - x.min())
)

          a_sum    a_mean    b_mean     c_sum   d_range
group                                                  
0      0.947337  0.473668  0.871939  0.838150  0.320543
1      0.604149  0.302074  0.656902  0.542985  0.057681
Entoil answered 3/8, 2019 at 13:50 Comment(6)
I like these named aggregations but I could not see how we are supposed to use them with multiple columns?Ugo
Good question, could not figure this out, doubt this is possible (yet). I opened a ticket for this. Will keep my question and you updated. Thanks for pointing out @SimonWoodheadEntoil
any progress on doing this with multiple columns?? i.e. (['a', 'b'], 'sum')Kore
@DerekEden have you found out?Hypocaust
@Hypocaust no sorryKore
I have suggested a workaround for aggregations based on multiple columns below.Warr
C
55

As an alternative (mostly on aesthetics) to Ted Petrou's answer, I found I preferred a slightly more compact listing. Please don't consider accepting it, it's just a much-more-detailed comment on Ted's answer, plus code/data. Python/pandas is not my first/best, but I found this to read well:

df.groupby('group') \
  .apply(lambda x: pd.Series({
      'a_sum'       : x['a'].sum(),
      'a_max'       : x['a'].max(),
      'b_mean'      : x['b'].mean(),
      'c_d_prodsum' : (x['c'] * x['d']).sum()
  })
)

          a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.530559  0.374540  0.553354     0.488525
1      1.433558  0.832443  0.460206     0.053313

I find it more reminiscent of dplyr pipes and data.table chained commands. Not to say they're better, just more familiar to me. (I certainly recognize the power and, for many, the preference of using more formalized def functions for these types of operations. This is just an alternative, not necessarily better.)


I generated data in the same manner as Ted, I'll add a seed for reproducibility.

import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.374540  0.950714  0.731994  0.598658      0
1  0.156019  0.155995  0.058084  0.866176      0
2  0.601115  0.708073  0.020584  0.969910      1
3  0.832443  0.212339  0.181825  0.183405      1
Cates answered 1/11, 2018 at 6:42 Comment(2)
I like this answer the most. This is similar to dplyr pipes in R.Baucis
To make this complete like Ted Petrou's answer: if you want multi-indexes you can specify tuples as the keys for the dictionary that you feed into pd.Series. For example, ('a', 'sum') : x['a'].sum() instead of 'a_sum' : x['a'].sum()Grumpy
B
19

New in version 0.25.0.

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
>>> animals = pd.DataFrame({
...     'kind': ['cat', 'dog', 'cat', 'dog'],
...     'height': [9.1, 6.0, 9.5, 34.0],
...     'weight': [7.9, 7.5, 9.9, 198.0]
... })

>>> print(animals)
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

>>> print(
...     animals
...     .groupby('kind')
...     .agg(
...         min_height=pd.NamedAgg(column='height', aggfunc='min'),
...         max_height=pd.NamedAgg(column='height', aggfunc='max'),
...         average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
...     )
... )
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well.

>>> print(
...     animals
...     .groupby('kind')
...     .agg(
...         min_height=('height', 'min'),
...         max_height=('height', 'max'),
...         average_weight=('weight', np.mean),
...     )
... )
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially apply them with functools.partial().

Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.

>>> print(
...     animals
...     .groupby('kind')
...     .height
...     .agg(
...         min_height='min',
...         max_height='max',
...     )
... )
      min_height  max_height
kind                        
cat          9.1         9.5
dog          6.0        34.0
Bighorn answered 21/11, 2019 at 2:49 Comment(2)
My next comment is a tip showing how to use a dictionary of named aggs. I can't seem to format the code nicely in the comment though, so I've also created an answer down below.Euroclydon
agg_dict = { "min_height": pd.NamedAgg(column='height', aggfunc='min'), "max_height": pd.NamedAgg(column='height', aggfunc='max'), "average_weight": pd.NamedAgg(column='weight', aggfunc=np.mean) } animals.groupby("kind").agg(**agg_dict)Euroclydon
E
7

This is a twist on 'exans' answer that uses Named Aggregations. It's the same but with argument unpacking which allows you to still pass in a dictionary to the agg function.

The named aggs are a nice feature, but at first glance might seem hard to write programmatically since they use keywords, but it's actually simple with argument/keyword unpacking.

animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                         'height': [9.1, 6.0, 9.5, 34.0],
                         'weight': [7.9, 7.5, 9.9, 198.0]})
 
agg_dict = {
    "min_height": pd.NamedAgg(column='height', aggfunc='min'),
    "max_height": pd.NamedAgg(column='height', aggfunc='max'),
    "average_weight": pd.NamedAgg(column='weight', aggfunc=np.mean)
}

animals.groupby("kind").agg(**agg_dict)

The Result

      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75
Euroclydon answered 22/7, 2020 at 18:4 Comment(0)
E
4

Ted's answer is amazing. I ended up using a smaller version of that in case anyone is interested. Useful when you are looking for one aggregation that depends on values from multiple columns:

create a dataframe

df = pd.DataFrame({
    'a': [1, 2, 3, 4, 5, 6], 
    'b': [1, 1, 0, 1, 1, 0], 
    'c': ['x', 'x', 'y', 'y', 'z', 'z']
})

print(df)
   a  b  c
0  1  1  x
1  2  1  x
2  3  0  y
3  4  1  y
4  5  1  z
5  6  0  z

grouping and aggregating with apply (using multiple columns)

print(
    df
    .groupby('c')
    .apply(lambda x: x['a'][(x['a'] > 1) & (x['b'] == 1)]
    .mean()
)
c
x    2.0
y    4.0
z    5.0

grouping and aggregating with aggregate (using multiple columns)

I like this approach since I can still use aggregate. Perhaps people will let me know why apply is needed for getting at multiple columns when doing aggregations on groups.

It seems obvious now, but as long as you don't select the column of interest directly after the groupby, you will have access to all the columns of the dataframe from within your aggregation function.

only access to the selected column

df.groupby('c')['a'].aggregate(lambda x: x[x > 1].mean())

access to all columns since selection is after all the magic

df.groupby('c').aggregate(lambda x: x[(x['a'] > 1) & (x['b'] == 1)].mean())['a']

or similarly

df.groupby('c').aggregate(lambda x: x['a'][(x['a'] > 1) & (x['b'] == 1)].mean())

I hope this helps.

Eterne answered 12/7, 2018 at 12:29 Comment(0)
S
0
df.groupby('c').aggregate(lambda x: x[(x['a'] > 1) & (x['b'] == 1)].mean())['a']

and

df.groupby('c').aggregate(lambda x: x['a'][(x['a'] > 1) & (x['b'] == 1)].mean())

do not work. Throws a key error: "KeyError: 'a'"

This one works (@r2evans's answer):

df.groupby('group') \
  .apply(lambda x: pd.Series({
      'a_sum'       : x['a'].sum(),
      'a_max'       : x['a'].max(),
      'b_mean'      : x['b'].mean(),
      'c_d_prodsum' : (x['c'] * x['d']).sum()
  })
)
Solace answered 28/9, 2023 at 21:42 Comment(0)
W
0

I found @Erfan's answer very helpful. I would like to suggest additional techniques for processing data from multiple columns. Given the following aggregated DataFrame as an example:

def series_range(series):
    return series.max() - series.min()

grouper = _df.reset_index().groupby('period_num')

_df_periods = grouper.agg(
    num_ashps=('count', 'min'),
    start=('timestamp', 'min'),
    length=('timestamp', series_range),
    elec_used=('elec_meter', series_range),
    heat_made=('heat_made', series_range),
    outdoor_temp=('outdoor_temp', 'mean')
)

One option is of course to add columns afterwards, if they can be computed from data that's already been summarised in the aggregated DataFrame, for example:

_df_periods['cop'] = _df_periods.heat_made / _df_periods.elec_used
_df_periods['length_mins'] = _df_periods.length.view(int) / (60 * 1e9)
_df_periods['avg_power'] = _df_periods.elec_used / (_df_periods.length_mins / 60)

Another option is to add columns afterwards, but based on data computed by subscripting the DataFrameGroupBy object (grouper in this case), to extract certain columns from the source DataFrame. For example, if heat_made or elec_used had not already been aggregated into _df_periods, we could have computed cop thus instead:

_df_periods['cop'] = ((grouper['heat_made'].max() -
                       grouper['heat_made'].min()) / 
                      (grouper['elec_meter'].max() - 
                       grouper['elec_meter'].min()))

This reuse of the same grouper object may help to keep the code short and readable. I don't know if it reuses the actual aggregation (group_by) computation (avoiding the need to recompute it), but I suspect not, i.e. this is cosmetic.

Warr answered 21/12, 2023 at 19:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.