Pandas sum of two columns - dealing with nan-values correctly
Asked Answered
R

5

5

When summing two pandas columns, I want to ignore nan-values when one of the two columns is a float. However when nan appears in both columns, I want to keep nan in the output (instead of 0.0).

Initial dataframe:

Surf1     Surf2
0         0
NaN       8
8         15
NaN       NaN
16        14
15        7

Desired output:

Surf1     Surf2     Sum
0         0         0
NaN       8         8
8         15        23
NaN       NaN       NaN
16        14        30
15        7         22

Tried code: -> the code below ignores nan-values but when taking the sum of two nan-values, it gives 0.0 in the output where I want to keep it as NaN in that particular case to keep these empty values separate from values that are actually 0 after summing.

import pandas as pd
import numpy as np

data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)

data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)
Rhatany answered 6/5, 2020 at 13:9 Comment(0)
G
12

From the documentation pandas.DataFrame.sum

By default, the sum of an empty or all-NA Series is 0.

>>> pd.Series([]).sum() # min_count=0 is the default 0.0

This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.

Change your code to

data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)

output

   Surf1  Surf2
0   10.0   22.0
1    NaN    8.0
2    8.0   15.0
3    NaN    NaN
4   16.0   14.0
5   15.0    7.0
   Surf1  Surf2   Sum
0   10.0   22.0  32.0
1    NaN    8.0   8.0
2    8.0   15.0  23.0
3    NaN    NaN   NaN
4   16.0   14.0  30.0
5   15.0    7.0  22.0
Genesisgenet answered 6/5, 2020 at 13:15 Comment(1)
Thanks, apparently you can achieve this in many different ways, but this is the cleanest answer to the question I think (the one that should be accepted).Rhatany
N
3

You could mask the result by doing:

df.sum(1).mask(df.isna().all(1))

0     0.0
1     8.0
2    23.0
3     NaN
4    30.0
5    22.0
dtype: float64
Nubbly answered 6/5, 2020 at 13:11 Comment(0)
K
3

You can do:

df['Sum'] = df.dropna(how='all').sum(1)

Output:

   Surf1  Surf2   Sum
0   10.0   22.0  32.0
1    NaN    8.0   8.0
2    8.0   15.0  23.0
3    NaN    NaN   NaN
4   16.0   14.0  30.0
5   15.0    7.0  22.0
Kilter answered 6/5, 2020 at 13:12 Comment(2)
Why in your example output the sum of 8.0 and 15.0 is 46.0?Genesisgenet
@Genesisgenet It's must have been because I ran the code twice :-). Updating.Kilter
K
2

You can use min_count, this will sum all the row when there is at least on not null, if all null return null

df['SUM']=df.sum(min_count=1,axis=1)
#df.sum(min_count=1,axis=1)
Out[199]: 
0     0.0
1     8.0
2    23.0
3     NaN
4    30.0
5    22.0
dtype: float64
Khmer answered 6/5, 2020 at 13:15 Comment(0)
D
1

I think All the solutions listed above work only for the cases when when it is the FIRST column value that is missing. If you have cases when the first column value is non-missing but the second column value is missing, try using:

df['sum'] = df['Surf1']

df.loc[(df['Surf2'].notnull()), 'sum'] = df['Surf1'].fillna(0) + df['Surf2']

Diverticulitis answered 19/10, 2021 at 7:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.