pandas groupby apply on multiple columns to generate a new column

I

2

6

I like to generate a new column in pandas dataframe using groupby-apply.

For example, I have a dataframe:

df = pd.DataFrame({'A':[1,2,3,4],'B':['A','B','A','B'],'C':[0,0,1,1]})

and try to generate a new column 'D' by groupby-apply.

This works:

df = df.assign(D=df.groupby('B').C.apply(lambda x: x - x.mean()))

as (I think) it returns a series with the same index with the dataframe:

In [4]: df.groupby('B').C.apply(lambda x: x - x.mean())
Out[4]:
0   -0.5
1   -0.5
2    0.5
3    0.5
Name: C, dtype: float64

But if I try to generate a new column using multiple columns, I cannot assign it directly to a new column. So this doesn't work:

 df.assign(D=df.groupby('B').apply(lambda x: x.A - x.C.mean()))

returning

TypeError: incompatible index of inserted column with frame index

and in fact, the groupby-apply returns:

In [8]: df.groupby('B').apply(lambda x: x.A - x.C.mean())
Out[8]:
B
A  0    0.5
   2    2.5
B  1    1.5
   3    3.5
Name: A, dtype: float64

I could do

df.groupby('B').apply(lambda x: x.A - x.C.mean()).reset_index(level=0,drop=True))

but it seems verbose and I am not sure if this will work as expected always.

So my question is: (i) when does pandas groupby-apply return a like-indexed series vs a multi-index series? (ii) is there a better way to assign a new column by groupby-apply to multiple columns?

Italy answered 10/11, 2017 at 16:31 Comment(0)

O

5

Let's use group_keys=False in the groupby

df.assign(D=df.groupby('B', group_keys=False).apply(lambda x: x.A - x.C.mean()))

Output:

   A  B  C    D
0  1  A  0  0.5
1  2  B  0  1.5
2  3  A  1  2.5
3  4  B  1  3.5

Osman answered 10/11, 2017 at 18:8 Comment(3)

This directly answers my question. – Italy 10/11, 2017 at 22:58

Be aware that this solution only works if the .apply() method returns a series with the same number of rows as the data frame your are assigning to. Otherwise the index of the returned series and the data frame will not match. – Haldeman 5/1, 2020 at 18:59

@Haldeman Thanks for pointing that out, I was wondering how something like that would possibly work – Outlandish 29/1, 2021 at 16:47

S

5

For this case I do not think include the column A in apply is necessary, we can use transform

df.A-df.groupby('B').C.transform('mean')
Out[272]: 
0    0.5
1    1.5
2    2.5
3    3.5
dtype: float64

And you can assign it back

df['diff']= df.A-df.groupby('B').C.transform('mean')
df
Out[274]: 
   A  B  C  diff
0  1  A  0   0.5
1  2  B  0   1.5
2  3  A  1   2.5
3  4  B  1   3.5

Sarisarid answered 10/11, 2017 at 16:36 Comment(0)

O

5