Pandas aggregation ignoring NaN's

Asked 1/10, 2014 at 16:1 Answered 22/4, 2019 at 10:47

I aggregate my Pandas dataframe: data. Specifically, I want to get the average and sum amounts by tuples of [origin and type]. For averaging and summing I tried the numpy functions below:

import numpy as np
import pandas as pd
result = data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum, pd.Series.mean]}).reset_index()

My issue is that the amount column includes NaNs, which causes the result of the above code to have a lot of NaN average and sums.

I know both pd.Series.sum and pd.Series.mean have skipna=True by default, so why am I still getting NaNs here?

I also tried this, which obviously did not work:

data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum(skipna=True), pd.Series.mean(skipna=True)]}).reset_index()

EDIT: Upon @Korem's suggestion, I also tried to use a partial as below:

s_na_mean = partial(pd.Series.mean, skipna = True)    
data.groupby(groupbyvars).agg({'amount': [ np.nansum, s_na_mean ]}).reset_index()

but get this error:

error: 'functools.partial' object has no attribute '__name__'

Heisenberg answered 1/10, 2014 at 16:1 Comment(2)

Could you post some example data? Also, to start with, rather than pd.Series.sum - just use 'sum' - the code should take a faster path. – Langford 1/10, 2014 at 16:26

Thank you, I decided to use pd.Series.sum jus becasue it had a skipna option. Reading @Korem's answer, I now use np.nansum. But np.nanmean is not available in my version (1.7.1) of numpy.I will try to post representative data, which may take a while. – Heisenberg 2/10, 2014 at 7:12

Use numpy's nansum and nanmean:

from numpy import nansum
from numpy import nanmean
data.groupby(groupbyvars).agg({'amount': [ nansum, nanmean]}).reset_index()

As a workaround for older version of numpy, and also a way to fix your last try:

When you do pd.Series.sum(skipna=True) you actually call the method. If you want to use it like this you want to define a partial. So if you don't have nanmean, let's define s_na_mean and use that:

from functools import partial
s_na_mean = partial(pd.Series.mean, skipna = True)

Jameson answered 1/10, 2014 at 19:6 Comment(6)

Thank you, I use numpy-1.7.1-py2.7-win32.egg, it does not like nanmean throwing the error: 'module' object has no attribute 'nanmean'. (I just checked, nanmean is new in verison 1.8.0 – Heisenberg 2/10, 2014 at 7:6

But np.nansum seems to be added in version 1.8.0 as well. It is curious that I do not get the same error for that... – Heisenberg 2/10, 2014 at 7:15

Thanks Korem, I tried this but it did not work, I edited my question, giving the error. Also, isn't skipna=True for pd.Series.mean by default anyways? – Heisenberg 2/10, 2014 at 8:53

@Heisenberg it is on by default, which suggest that the problem you're seeing is not where you think it is. – Jameson 2/10, 2014 at 9:0

You are right, I tried this, which ran: data.groupby(groupbyvars).agg({'amount': [ np.nansum, lambda x: pd.Series.mean(x,skipna=True)]}).reset_index() but still get NaNs. I will investigate further. Maybe those are the cases for which all I have is NaN.. – Heisenberg 2/10, 2014 at 9:13

pandas doc : "skipna : boolean, default True", "Exclude NA/null values. If an entire row/column is NA, the result will be NA" – Disinterest 22/9, 2017 at 7:13

It might be too late but anyways it might be useful for others.

Try apply function:

import numpy as np
import pandas as pd

def nan_agg(x):
    res = {}

    res['nansum'] = x.loc[ not x['amount'].isnull(), :]['amount'].sum()
    res['nanmean'] = x.loc[ not x['amount'].isnull(), :]['amount'].mean()

    return pd.Series(res, index=['nansum', 'nanmean'])

result = data.groupby(groupbyvars).apply(nan_agg).reset_index()

Declinatory answered 22/4, 2019 at 10:47 Comment(0)

Recommended topics

Hot tags