In case you want to do the same for every groups you can use this trick
data = pd.DataFrame({'a':list('aaaddd'),
'Sex':['female','female','male','female','female','male'],
'Pclass':[1,2,1,2,1,1],
'Age':[40,20,30,20, np.nan, np.nan]})
df = data.groupby(["Sex","Pclass"])["Age"].median().to_frame().reset_index()
df.rename(columns={"Age":"Med"}, inplace=True)
data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
data["Age"] = np.where(data["Age"].isnull(), data["Med"], data["Age"])
UPDATE:
# dummy dataframe
n = int(1e7)
data = pd.DataFrame({"Age":np.random.choice([10,20,20,30,30,40,np.nan], n),
"Pclass":np.random.choice([1,2,3], n),
"Sex":np.random.choice(["male","female"], n),
"a":np.random.choice(["a","b","c","d"], n)})
In my machine running this (is as the previous without renaming)
df = data.groupby(["Sex","Pclass"])["Age"].agg(['median']).reset_index()
data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
data["Age"] = np.where(data["Age"].isnull(), data["median"], data["Age"])
CPU times: user 1.98 s, sys: 216 ms, total: 2.2 s
Wall time: 2.2 s
While the mask solution took:
for sex in ["male", "female"]:
for pclass in range(1,4):
mask1 =(data['Sex'] == sex)&(data['Pclass'] == pclass)
med = data.loc[mask1, 'Age'].median()
data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)
CPU times: user 5.13 s, sys: 60 ms, total: 5.19 s
Wall time: 5.19 s
@jezrael solution is even faster
data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))
CPU times: user 1.34 s, sys: 92 ms, total: 1.44 s
Wall time: 1.44 s
NaN
s by median per group, is necessary onlygroupby
. Check edit in my answer. – Bourg