Randomly introduce NaN values in pandas dataframe
Asked Answered
B

4

7

How could I randomly introduce NaN values ​​into my dataset for each column taking into account the null values ​​already in my starting data.

I want to have for example 20% of NaN values ​​by column.

For example:
If I have in my dataset 3 columns: "A", "B" and "C" for each columns I have NaN values rate how do I introduce randomly NaN values ​​by column to reach 20% per column:

A: 10% nan
B: 15% nan
C: 8% nan

For the moment I tried this code but it degrades too much my dataset and I think that it is not the good way:

df = df.mask(np.random.choice([True, False], size=df.shape, p=[.20,.80]))
Biff answered 23/1, 2019 at 15:21 Comment(3)
But does the code you provide work? Because if you already have existing NaN values you'll likely go over the 20% you want in each column.Pedant
yes i have existing NaN values and i want to get 20% not over or less in each column.Biff
yes the code it work, but i think it introduces 20% in the dataset not per column.Biff
P
4

I am not sure what do you mean by the last part ("degrades too much") but here is a rough way to do it.

import numpy as np
import pandas as pd

A = pd.Series(np.arange(99))

# Original missing rate (for illustration)
nanidx = A.sample(frac=0.1).index
A[nanidx] = np.NaN

###
# Complementing to 20%
# Original ratio
ori_rat = A.isna().mean()

# Adjusting for the dataframe without missing values
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)

nanidx2 = A.dropna().sample(frac=add_miss_rat).index
A[nanidx2] = np.NaN

A.isna().mean()

Obviously, it will not always be exactly 20%...

Update Applying it for the whole dataframe

for col in df:
    ori_rat = df[col].isna().mean()

    if ori_rat >= 0.2: continue

    add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
    vals_to_nan = df[col].dropna().sample(frac=add_miss_rat).index
    df.loc[vals_to_nan, col] = np.NaN

Update 2 I made a correction to also take into account the effect of dropping NaN values when calculating the ratio.

Pusillanimity answered 23/1, 2019 at 15:45 Comment(4)
My example was just to illustrate, but basically I do not know the NaN rate in my df and for each column. Ideally, it is necessary to calculate the rates of NaN values for each column, then to complete by NaN values ramdomly to have in the end 20% NaN values by column.Biff
You can ignore the first part (where I generate the original nan rate). That was just for illustration. the second part does what you need for a particular column (of course, it also calculates the original missing values on the way...).Pusillanimity
when i exute the code, i have this ValueError: A negative number of rows requested. Please provide positive value.Biff
This is because some of the original missing rates are higher than 0.2. I adjusted the code so it will leave those untouched.Pusillanimity
D
1

Unless you have a giant DataFrame and speed is a concern, the easy-peasy way to do it is by iteration.

import pandas as pd
import numpy as np
import random

df = pd.DataFrame({'A':list(range(100)),'B':list(range(100)),'C':list(range(100))})
#before adding nan
print(df.head(10))

nan_percent = {'A':0.10, 'B':0.15, 'C':0.08}

for col in df:
    for i, row_value in df[col].iteritems():
        if random.random() <= nan_percent[col]:
            df[col][i] = np.nan
#after adding nan            
print(df.head(10))
Degraw answered 23/1, 2019 at 16:7 Comment(9)
if i have a large dataset (> 100 columns) how should i do in nan_percent ?Biff
It's the same, it is based on the output of random.random() which is a random float from 0 to 1. 10% = 0.1, 1%=0.01 etc. The length of the column doesn't matter.Degraw
how long does it take if I have 40 columns ? I executed it in my code but it's been more than 30 minutes since it turnsBiff
It depends by both number of col and numbers of rows. Hard to tell since it depends from your hardware as well. 30 mins sounds too much. How many rows is your df?Degraw
My hardware is nice i think. i have 150 000 rows and 40 columns. before doing that I build myself a dictionary and I put the set of columns with their values ​​that I want to degradeBiff
150k rows and 40 cols is quite big, not sure how long should it take. Test a subset like 1k rows and see. If it takes too long and you aim for faster execution time, then this solution is not what you are looking for. Iteration is slow.Degraw
Ok, i see. I think iterate it is not a good way to do it. from the same base how to do it without iterating ? i already have my dic of column nan_percent.Biff
i'm going to try something like this df = df.mask(np.random.choice([True, False], size=df.shape, p=[])) and put my nan_percent dic inside p = [] and show how it workBiff
You can try using apply pandas.pydata.org/pandas-docs/stable/reference/api/…Degraw
D
1

I guess I am a little late to the party but if someone needs a solution that's faster and takes the percentage value into account when introducing null values, here's the code:

nan_percent = {'A':0.15, 'B':0.05, 'C':0.23}

for col, perc in nan_percent.items():
    df['null'] = np.random.choice([0, 1], size=df.shape[0], p=[1-perc, perc])
    df.loc[df['null'] == 1, col] = np.nan

df.drop(columns=['null'], inplace=True)
Divertimento answered 25/4, 2021 at 9:7 Comment(0)
M
0

Here is a way to get as close to 20% nan in each column as possible:

def input_nan(x,pct):
    n = int(len(x)*(pct - x.isna().mean()))
    idxs = np.random.choice(len(x), max(n,0), replace=False, p=x.notna()/x.notna().sum())
    x.iloc[idxs] = np.nan

df.apply(input_nan, pct=.2)

It first takes the difference between the NaN percent you want, and the percent NaN values in your dataset already. Then it multiplies it by the length of the column, which gives you how many NaN values you want to put in (n). Then uses np.random.choice which randomly choose n indexes that don't have NaN values in them.

Example:

df = pd.DataFrame({'y':np.random.randn(10), 'x1':np.random.randn(10), 'x2':np.random.randn(10)})
df.y.iloc[1]=np.nan
df.y.iloc[8]=np.nan
df.x2.iloc[5]=np.nan

#           y        x1        x2
# 0  2.635094  0.800756 -1.107315
# 1       NaN  0.055017  0.018097
# 2  0.673101 -1.053402  1.525036
# 3  0.246505  0.005297  0.289559
# 4  0.883769  1.172079  0.551917
# 5 -1.964255  0.180651       NaN
# 6 -0.247067  0.431622 -0.846953
# 7  0.603750  0.475805  0.524619
# 8       NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071  0.029515

df.apply(input_nan)

#           y        x1        x2
# 0  2.635094  0.800756 -1.107315
# 1       NaN  0.055017  0.018097
# 2  0.673101 -1.053402  1.525036
# 3  0.246505  0.005297       NaN
# 4  0.883769  1.172079  0.551917
# 5 -1.964255       NaN       NaN
# 6 -0.247067  0.431622 -0.846953
# 7  0.603750       NaN  0.524619
# 8       NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071  0.029515

I have applied it to the whole dataset, but you can apply it to any column you want. For example, if you wanted 15% NaNs in columns y and x1, you could call df[['y','x1]].apply(input_nan, pct=.15)

Minimal answered 23/1, 2019 at 16:31 Comment(4)
I see, but i have a large dataset with > 100 columns, i can do this for each column.Biff
that did not change my df because I calculated the NaN rate before and after for each column and there was no changeBiff
Hmm, it works on the df in my example. What did you call, and what was the NaN percent before and after for the columns you called it on?Minimal
Just applied the function to a dataframe of 10,000 observations (that had mixed NaN percentages before applying the function). Calling df.isna().mean() gives the desired result: y = 0.2, x1 = 0.2, x2 = 0.2Minimal

© 2022 - 2024 — McMap. All rights reserved.