How to make a slice of DataFrame and "fillna" in specific slice using Python Pandas?

Asked 23/11, 2017 at 14:34 Answered 11/5, 2020 at 17:46

Solved python pandas dataframe data-analysis

The problem: let us take Titanic dataset from Kaggle. I have dataframe with columns "Pclass", "Sex" and "Age". I need to fill NaN in column "Age" with a median for certain group. If it is a woman from 1st class, I would like to fill her age with the median for 1st class women, not with the median for whole Age column.

The question is how to make this change in a certain slice?

I tried:

data['Age'][(data['Sex'] == 'female')&(data['Pclass'] == 1)&(data['Age'].isnull())].fillna(median)

where the "median" is my value, but nothing changes "inplace=True" didn't help.

Thanks alot!

Cementum answered 23/11, 2017 at 14:34 Comment(1)

I add solution for filling NaNs by median per group, is necessary only groupby. Check edit in my answer. – Bourg 24/11, 2017 at 8:29

I believe you need filter by masks and assign back:

data = pd.DataFrame({'a':list('aaaddd'),
                     'Sex':['female','female','male','female','female','male'],
                     'Pclass':[1,2,1,2,1,1],
                     'Age':[40,20,30,20,np.nan,np.nan]})

print (data)
    Age  Pclass     Sex  a
0  40.0       1  female  a
1  20.0       2  female  a
2  30.0       1    male  a
3  20.0       2  female  d
4   NaN       1  female  d
5   NaN       1    male  d

#boolean mask
mask1 = (data['Sex'] == 'female')&(data['Pclass'] == 1)

#get median by mask without NaNs
med = data.loc[mask1, 'Age'].median()
print (med)
40.0

#repalce NaNs
data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)
print (data)
    Age  Pclass     Sex  a
0  40.0       1  female  a
1  20.0       2  female  a
2  30.0       1    male  a
3  20.0       2  female  d
4  40.0       1  female  d
5   NaN       1    male  d

What is same as:

mask2 = mask1 &(data['Age'].isnull())

data.loc[mask2, 'Age'] = med
print (data)
    Age  Pclass     Sex  a
0  40.0       1  female  a
1  20.0       2  female  a
2  30.0       1    male  a
3  20.0       2  female  d
4  40.0       1  female  d
5   NaN       1    male  d

EDIT:

If need replace all groups NaNs by median:

data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))
print (data)

    Age  Pclass     Sex  a
0  40.0       1  female  a
1  20.0       2  female  a
2  30.0       1    male  a
3  20.0       2  female  d
4  40.0       1  female  d
5  30.0       1    male  d

Bourg answered 23/11, 2017 at 14:36 Comment(3)

Yes, exactly. Because eklse get NaN. – Bourg 23/11, 2017 at 14:50

You are welcome! Maybe small advice to future - how to provide a great pandas example ;) – Bourg 23/11, 2017 at 15:1

I see your question in comment and try answer. for loops are possible in pandas, but it is slow. Jeff, one of main developer of pandas at the moment, post very nice answer - check here. – Bourg 24/11, 2017 at 8:43

I want to add an answer here that is a bit more efficient, as it involves a little less code. Essentially, if you are are using boolean conditions to slice your dataframe and are using .fillna under those specific conditions, just use an assignment:

I will use an example from a different Kaggle competition:

# Use a mask as suggested by jesrael. It's just neater:
mask1 = (test_df.Neighborhood == 'IDOTRR') & (test_df.MSZoning.isna())
mask2 = (test_df.Neighborhood == 'Mitchel') & (test_df.MSZoning.isna())

# Use the mask and assign the desired value 
test_df.loc[mask1, 'MSZoning'] = 'RM'
test_df.loc[mask2, 'MSZoning'] = 'RL'

This is different than jesrael's answer in that he/she uses .fillna() to assign back to the masked dataframe. If you are going to use a mask, and have a specific value in mind, there is no need to use '.fillna()'

Fina answered 11/5, 2020 at 17:46 Comment(0)

In case you want to do the same for every groups you can use this trick

data = pd.DataFrame({'a':list('aaaddd'),
                    'Sex':['female','female','male','female','female','male'],
                    'Pclass':[1,2,1,2,1,1],
                    'Age':[40,20,30,20, np.nan, np.nan]})
df = data.groupby(["Sex","Pclass"])["Age"].median().to_frame().reset_index()
df.rename(columns={"Age":"Med"}, inplace=True)
data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
data["Age"] = np.where(data["Age"].isnull(), data["Med"], data["Age"])

UPDATE:

# dummy dataframe
n = int(1e7)
data = pd.DataFrame({"Age":np.random.choice([10,20,20,30,30,40,np.nan], n),
                     "Pclass":np.random.choice([1,2,3], n),
                     "Sex":np.random.choice(["male","female"], n),
                     "a":np.random.choice(["a","b","c","d"], n)})

In my machine running this (is as the previous without renaming)

df = data.groupby(["Sex","Pclass"])["Age"].agg(['median']).reset_index()
data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
data["Age"] = np.where(data["Age"].isnull(), data["median"], data["Age"])

CPU times: user 1.98 s, sys: 216 ms, total: 2.2 s
Wall time: 2.2 s

While the mask solution took:

for sex in ["male", "female"]:
    for pclass in range(1,4):
        mask1 =(data['Sex'] == sex)&(data['Pclass'] == pclass)
        med = data.loc[mask1, 'Age'].median()
        data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)

CPU times: user 5.13 s, sys: 60 ms, total: 5.19 s
Wall time: 5.19 s

@jezrael solution is even faster

data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))

CPU times: user 1.34 s, sys: 92 ms, total: 1.44 s
Wall time: 1.44 s

Fascia answered 23/11, 2017 at 15:38 Comment(1)

Thanks for ur answer! I have an additional question. I used another way using for-loops. But I think ur way could be better because it uses numpy and pandas which is C++ and must be faster if will be used on large datasets, am I right? – Cementum 24/11, 2017 at 8:37

Recommended topics

Hot tags