Pandas GroupBy.apply method duplicates first group

Asked 27/1, 2014 at 19:37 Answered 20/5, 2019 at 6:32

Solved python pandas group-by pandas-groupby

My first SO question: I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:

>>> from pandas import Series, DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
>>> print(df)
   class  count  
0     A      1  
1     B      0    
2     C      2

I first check that the groupby function works ok, and it seems to be fine:

>>> for group in df.groupby('class', group_keys = True):
>>>     print(group)
('A',   class  count
0     A      1)
('B',   class  count
1     B      0)
('C',   class  count
2     C      2)

Then I try to do something similar using apply on the groupby object and I get the first row output twice:

>>> def checkit(group):
>>>     print(group)
>>> df.groupby('class', group_keys = True).apply(checkit)
  class  count
0     A      1
  class  count
0     A      1
  class  count
1     B      0
  class  count
2     C      2

Any help would be appreciated! Thanks.

Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:

>>> def addone(group):
>>>     group['count'] += 1
>>>     return group

>>> df.groupby('class', group_keys = True).apply(addone)
>>> print(df)

      class  count
0     A      1
1     B      0
2     C      2

But by assigning the return of the method to a new object, we see that it works as expected:

>>> df2 = df.groupby('class', group_keys = True).apply(addone)
>>> print(df2)

      class  count
0     A      2
1     B      1
2     C      3

Dorfman answered 27/1, 2014 at 19:37 Comment(2)

From v0.25, the behaviour will change so the first group is only evaluated once. Please see here. – Lipstick 20/5, 2019 at 6:37

Please update the accepted answer to this answer, as the old answer is no longer valid. – Penal 30/8, 2020 at 6:26

This "issue" has now been fixed: Upgrade to 0.25+

Starting from v0.25, GroupBy.apply() will only evaluate the first group once. See GH24748.

What’s new in 0.25.0 (July 18, 2019): Groupby.apply on DataFrame evaluates first group only once

Relevant example from documentation:

pd.__version__                                                                                                          
# '0.25.0.dev0+590.g44d5498d8'

df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})                                                                      

def func(group): 
    print(group.name) 
    return group

New behaviour (>=v0.25):

df.groupby('a').apply(func)                                                                                            
x
y

   a  b
0  x  1
1  y  2

Old behaviour (<=v0.24.x):

df.groupby('a').apply(func)
x
x
y

   a  b
0  x  1
1  y  2

Pandas still uses the first group to determine whether apply can take a fast path or not. But at least it no longer has to evaluate the first group twice. Nice work, devs!

Lipstick answered 20/5, 2019 at 6:32 Comment(3)

Oh so basically Pandas will still run apply twice on the first row. This fix only applies to the group in groupby.apply. Damn. – Rutger 13/9, 2019 at 7:59

@Rutger This is also now the case for .apply. – Penal 30/8, 2020 at 6:23

Which version of pandas? – Rutger 31/8, 2020 at 2:7

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.

Depending on your actual use case, you can replace the call to apply with aggregate, transform or filter, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.

However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.

Berke answered 8/9, 2014 at 1:39 Comment(0)

This "issue" has now been fixed: Upgrade to 0.25+

Starting from v0.25, GroupBy.apply() will only evaluate the first group once. See GH24748.

What’s new in 0.25.0 (July 18, 2019): Groupby.apply on DataFrame evaluates first group only once

Relevant example from documentation:

pd.__version__                                                                                                          
# '0.25.0.dev0+590.g44d5498d8'

df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})                                                                      

def func(group): 
    print(group.name) 
    return group

New behaviour (>=v0.25):

df.groupby('a').apply(func)                                                                                            
x
y

   a  b
0  x  1
1  y  2

Old behaviour (<=v0.24.x):

df.groupby('a').apply(func)
x
x
y

   a  b
0  x  1
1  y  2

Pandas still uses the first group to determine whether apply can take a fast path or not. But at least it no longer has to evaluate the first group twice. Nice work, devs!

Lipstick answered 20/5, 2019 at 6:32 Comment(3)

Oh so basically Pandas will still run apply twice on the first row. This fix only applies to the group in groupby.apply. Damn. – Rutger 13/9, 2019 at 7:59

@Rutger This is also now the case for .apply. – Penal 30/8, 2020 at 6:23

Which version of pandas? – Rutger 31/8, 2020 at 2:7

you can use for loop to avoid the groupby.apply duplicate first row,

log_sample.csv

guestid,keyword
1,null
2,null
2,null
3,null
3,null
3,null
4,null
4,null
4,null
4,null

my code snippit

df=pd.read_csv("log_sample.csv") 
grouped = df.groupby("guestid")

for guestid, df_group in grouped:
    print(list(df_group['guestid'])) 

df.head(100)

output

[1]
[2, 2]
[3, 3, 3]
[4, 4, 4, 4]

Voltmeter answered 4/4, 2018 at 3:17 Comment(0)

This "issue" has now been fixed: Upgrade to 0.25+

This "issue" has now been fixed: Upgrade to 0.25+

Recommended topics

Hot tags