How to apply *multiple* functions to pandas groupby apply?
Asked Answered
F

2

5

I have a dataframe which shall be grouped and then on each group several functions shall be applied. Normally, I would do this with groupby().agg() (cf. Apply multiple functions to multiple groupby columns), but the functions I'm interested do not need one column as input but multiple columns.

I learned that, when I have one function that has multiple columns as input, I need apply (cf. Pandas DataFrame aggregate function using multiple columns). But what do I need, when I have multiple functions that have multiple columns as input?

import pandas as pd
df = pd.DataFrame({'x':[2, 3, -10, -10], 'y':[10, 13, 20, 30], 'id':['a', 'a', 'b', 'b']})

def mindist(data): #of course these functions are more complicated in reality
     return min(data['y'] - data['x'])
def maxdist(data):
    return max(data['y'] - data['x'])

I would expect something like df.groupby('id').apply([mindist, maxdist])

    min   max
id      
 a    8    10
 b   30    40

(achieved with pd.DataFrame({'mindist':df.groupby('id').apply(mindist),'maxdist':df.groupby('id').apply(maxdist)} - which obviously isn't very handy if I have a dozend of functions to apply on the grouped dataframe). Initially I thought this OP had the same question, but he seems to be fine with aggregate, meaning his functions take only one column as input.

Fivefinger answered 12/8, 2019 at 15:44 Comment(0)
V
10

For this specific issue, how about groupby after difference?

(df['x']-df['y']).groupby(df['id']).agg(['min','max'])

More generically, you could probably do something like

df.groupby('id').apply(lambda x:pd.Series({'min':mindist(x),'max':maxdist(x)}))
Voelker answered 12/8, 2019 at 16:27 Comment(1)
The general solution is what I was looking for. (As I mention in the comments of my code, the functions are more complicated in reality - so imo you can delete the first part of your answer ;))Fivefinger
E
7

IIUC you want to use several functions within the same group. In this case you should return a pd.Series. In the following toy example I want to

  1. sum columns A and B then calculate the mean
  2. sum columns C and D then calculate the std
import pandas as pd
df = pd.util.testing.makeDataFrame().head(10)
df["key"] = ["key1"] * 5 + ["key2"] * 5

def fun(x):
    m = (x["A"]+x["B"]).mean()
    s = (x["C"]+x["D"]).std()
    return pd.Series({"meanAB":m, "stdCD":s})

df.groupby("key").apply(fun)

Update Which in your case became

import pandas as pd

df = pd.DataFrame({'x':[2, 3, -10, -10],
                   'y':[10, 13, 20, 30],
                   'id':['a', 'a', 'b', 'b']})

def mindist(data): #of course these functions are more complicated in reality
     return min(data['y'] - data['x'])

def maxdist(data):
    return max(data['y'] - data['x'])

def fun(data):
    return pd.Series({"maxdist":maxdist(data),
                      "mindist":mindist(data)})

df.groupby('id').apply(fun)
Edam answered 12/8, 2019 at 16:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.