specifying "skip NA" when calculating mean of the column in a data frame created by Pandas

Asked 30/7, 2014 at 14:22 Answered 12/4, 2017 at 17:10

I am learning Pandas package by replicating the outing from some of the R vignettes. Now I am using the dplyr package from R as an example:

http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

R script

planes <- group_by(hflights_df, TailNum)
delay <- summarise(planes,
  count = n(),
  dist = mean(Distance, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

Python script

planes = hflights.groupby('TailNum')
planes['Distance'].agg({'count' : 'count',
                        'dist' : 'mean'})

How can I state explicitly in python that NA needs to be skipped?

Multifaceted answered 30/7, 2014 at 14:22 Comment(0)

That's a trick question, since you don't do that. Pandas will automatically exclude NaN numbers from aggregation functions. Consider my df:

    b   c   d  e
a               
2   2   6   1  3
2   4   8 NaN  7
2   4   4   6  3
3   5 NaN   2  6
4 NaN NaN   4  1
5   6   2   1  8
7   3   2   4  7
9   6   1 NaN  1
9 NaN NaN   9  3
9   3   4   6  1

The internal count() function will ignore NaN values, and so will mean(). The only point where we get NaN, is when the only value is NaN. Then, we take the mean value of an empty set, which turns out to be NaN:

In[335]: df.groupby('a').mean()
Out[333]: 
          b    c    d         e
a                              
2  3.333333  6.0  3.5  4.333333
3  5.000000  NaN  2.0  6.000000
4       NaN  NaN  4.0  1.000000
5  6.000000  2.0  1.0  8.000000
7  3.000000  2.0  4.0  7.000000
9  4.500000  2.5  7.5  1.666667

Aggregate functions work in the same way:

In[340]: df.groupby('a')['b'].agg({'foo': np.mean})
Out[338]: 
        foo
a          
2  3.333333
3  5.000000
4       NaN
5  6.000000
7  3.000000
9  4.500000

Addendum: Notice how the standard dataframe.mean API will allow you to control inclusion of NaN values, where the default is exclusion.

Hoeg answered 30/7, 2014 at 14:49 Comment(2)

Thanks, and how do I do the opposite: make the pandas include NaN? – Abreu 12/7, 2018 at 8:52

@Dr_Zaszuś have a look at the last line, which links to the manual. It lists the option of including NaN. You can build on top of that as the other answer suggests. – Hoeg 12/7, 2018 at 9:25

What foobar said is true in regards to how it was implemented by default, but there is a very easy way to specify skipna. Here is an exemple that speaks for itself:

def custom_mean(df):
    return df.mean(skipna=False)

group.agg({"your_col_name_to_be_aggregated":custom_mean})

That's it! You can customize your own aggregation the way you want, and I'd expect this to be fairly efficient, but I did not dig into it.

It was also discussed here, but I thought I'd help spread the good news! Answer was found in the official doc.

Ethylethylate answered 12/4, 2017 at 17:10 Comment(2)

@lokheart, this might interest you. – Ethylethylate 12/4, 2017 at 17:11

Why np.mean does not work? – Gerri 4/10, 2021 at 20:16

R script

Python script

Recommended topics

Hot tags