Here is a way to get as close to 20% nan in each column as possible:
def input_nan(x,pct):
n = int(len(x)*(pct - x.isna().mean()))
idxs = np.random.choice(len(x), max(n,0), replace=False, p=x.notna()/x.notna().sum())
x.iloc[idxs] = np.nan
df.apply(input_nan, pct=.2)
It first takes the difference between the NaN
percent you want, and the percent NaN
values in your dataset already. Then it multiplies it by the length of the column, which gives you how many NaN
values you want to put in (n
). Then uses np.random.choice
which randomly choose n
indexes that don't have NaN
values in them.
Example:
df = pd.DataFrame({'y':np.random.randn(10), 'x1':np.random.randn(10), 'x2':np.random.randn(10)})
df.y.iloc[1]=np.nan
df.y.iloc[8]=np.nan
df.x2.iloc[5]=np.nan
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 0.289559
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 0.180651 NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 0.475805 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
df.apply(input_nan)
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 NaN
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 NaN NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 NaN 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
I have applied it to the whole dataset, but you can apply it to any column you want. For example, if you wanted 15% NaNs in columns y
and x1
, you could call df[['y','x1]].apply(input_nan, pct=.15)
NaN
values you'll likely go over the 20% you want in each column. – Pedant