Pandas groupby mean() not ignoring NaNs
Asked Answered
K

6

16

If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour

import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
     a
b     
1  1.5
2  3.0
c.groupby('b').agg(np.mean)
     a
b     
1  1.5
2  3.0

I want to receive following result:

     a
b     
1  1.5
2  NaN

I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.

Do you know of an argument to achieve the desired behaviour with the optimized functions?

Btw, I think the desired behaviour was implemented in a previous version of pandas.

Kempis answered 9/1, 2019 at 8:42 Comment(1)
This behavior of Pandas is dreadful... None of the solutions proposed make standard functions like np.mean() work as it should (returning NaN)Kesler
C
16

By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:

In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]: 
     a
b     
1  1.5
2  NaN
Celestine answered 9/1, 2019 at 9:8 Comment(0)
R
15

There is mean(skipna=False), but it's not working

GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.

Quick workaround

Complete working example based on this comments: @Serge Ballesta, @RoelAdriaans

>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)

     a
b     
1  1.5
2  NaN

For additional information and updates follow the link above.

Ropedancer answered 28/5, 2020 at 23:0 Comment(0)
C
4

Use the skipna option -

c.groupby('b').apply(lambda g: g.mean(skipna=False))
Chaney answered 9/1, 2019 at 8:49 Comment(0)
N
2

Another approach would be to use a value that is not ignored by default, for example np.inf:

>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
          a
b          
1  1.500000
2       inf
Nobel answered 9/1, 2019 at 9:27 Comment(1)
Before the mean calculation you can use fillna(np.inf) and after the mean you can use .replace([np.inf, -np.inf], np.nan) to restore the nan values.Anse
P
2

There are three different methods for it:

  1. slowest:
    c.groupby('b').apply(lambda g: g.mean(skipna=False))
  1. faster than apply but slower than default sum:
    c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
  1. Fastest but need more codes:
    method3 = c.groupby('b').sum()
    nan_index = c[c['b'].isna()].index.to_list()
    method3.loc[method3.index.isin(nan_index)] = np.nan

enter image description here

Profanity answered 14/1, 2020 at 9:59 Comment(0)
C
0

I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.

So, here is what I came up with:

Setup:

df = pd.DataFrame({
    'a': [1, 2, 1, 2],
    'b': [1, np.nan, 2, 3],
    'c': [1, np.nan, 2, np.nan],
    'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')

Default behavior:

gb.sum()

Out[]:
     b    c                   d
a                              
1  3.0  3.0  0.000000+2.000000j
2  3.0  0.0  0.000000+0.000000j

A single NaN kills the group:

cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)

Out[]:
     b    c   d
a              
1  3.0  3.0 NaN
2  NaN  NaN NaN

Only NaN if all values in group are NaN:

cnt = gb.count()
gb.sum() * (cnt / cnt)
out

Out[]:
     b    c                   d
a                              
1  3.0  3.0  0.000000+2.000000j
2  3.0  NaN                 NaN

Corollary: mean of complex:

cnt = gb.count()
gb.sum() / cnt

Out[]:
     b    c                   d
a                              
1  1.5  1.5  0.000000+2.000000j
2  3.0  NaN                 NaN
Cherice answered 11/12, 2020 at 2:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.