Filling missing data by random choosing from non missing values in pandas dataframe

C

9

13

I have a pandas data frame where there are a several missing values. I noticed that the non missing values are close to each other. Thus, I would like to impute the missing values by randomly choosing the non missing values.

For instance:

import pandas as pd
import random
import numpy as np

foo = pd.DataFrame({'A': [2, 3, np.nan, 5, np.nan], 'B':[np.nan, 4, 2, np.nan, 5]})
foo
    A   B
0   2 NaN
1   3   4
2 NaN   2   
3   5 NaN
4 NaN   5

I would like for instance foo['A'][2]=2 and foo['A'][5]=3 The shape of my pandas DataFrame is (6940,154). I try this

foo['A'] = foo['A'].fillna(random.choice(foo['A'].values.tolist()))

But it not working. Could you help me achieve that? Best regards.

Candlestick answered 4/4, 2016 at 21:34 Comment(0)

A

11

You can use pandas.fillna method and the random.choice method to fill the missing values with a random selection of a particular column.

import random
import numpy as np

df["column"].fillna(lambda x: random.choice(df[df[column] != np.nan]["column"]), inplace =True)

Where column is the column you want to fill with non nan values randomly.

Acarus answered 4/4, 2016 at 22:2 Comment(5)

I try it. But instead of imputing with the values it put <function <lambda> at 0x7fa4eb48b9b0>.. – Candlestick 4/4, 2016 at 22:5

sorry, can you provide some sample data? – Acarus 4/4, 2016 at 22:8

I have edited my question with a sample data. Thanks – Candlestick 4/4, 2016 at 22:15

I have fund the answer. I did this : foo = foo.apply(lambda x: x.fillna(random.choice(x.dropna())), axis=1). Your answer gave the clue. Thank you very much for your help. – Candlestick 4/4, 2016 at 22:25

no worries. Glad I could help :) It was a bit confusing. – Acarus 4/4, 2016 at 22:32

C

7

This works well for me on Pandas DataFrame

def randomiseMissingData(df2):
    "randomise missing data for DataFrame (within a column)"
    df = df2.copy()
    for col in df.columns:
        data = df[col]
        mask = data.isnull()
        samples = random.choices( data[~mask].values , k = mask.sum() )
        data[mask] = samples

return df

Cowper answered 10/12, 2017 at 23:52 Comment(2)

For a pandas data frame this is a smart way of doing it as the statistics of the sample data reflects by definition the statistics of the original data. In this way you can fill the gaps while maintaining the same stats. – Denham 24/8, 2018 at 14:57

Better assign with df.loc[mask, col] = samples to avoid warnings – Virile 6/5, 2020 at 0:59

M

6

I did this for filling NaN values with a random non-NaN value:

import random

df['column'].fillna(random.choice(df['column'][df['column'].notna()]), inplace=True)

Macula answered 23/4, 2021 at 18:32 Comment(0)

S

3

This is another approach to this question after making improvement on the first answer and according to how to check if an numpy int is nand found here in numpy documentation

foo['A'].apply(lambda x: np.random.choice([x for x in range(min(foo['A']),max(foo['A'])]) if (np.isnan(x)) else x)

Sidetrack answered 23/5, 2017 at 17:40 Comment(0)

T

1

Here is another Pandas DataFrame approach

import numpy as np
def fill_with_random(df2, column):
    '''Fill `df2`'s column with name `column` with random data based on non-NaN data from `column`'''
    df = df2.copy()
    df[column] = df[column].apply(lambda x: np.random.choice(df[column].dropna().values) if np.isnan(x) else x)
    return df

Twirp answered 2/1, 2018 at 16:40 Comment(0)

U

0

for me only this worked, all the examples above failed. Some filled same number, some didn't fill nothing.

def fill_sample(df, col):
    tmp = df[df[col].notna()[col].sample(len(df[df[col].isna()])).values
    k = 0
    for i,row in df[df[col].isna()].iterrows():

       df.at[i, col] = tmp[k]
       k+=1
    return df

Urey answered 22/7, 2021 at 20:32 Comment(1)

Please don't embed code as a screenshot. Instead, paste it as text, and use Markdown to format it as code. That makes it easier to read, copy, and paste. It also helps ensure that it shows up in search results. – Busterbustle 22/7, 2021 at 20:40

C

0

What I ended up doing and that worked was:

foo = foo.apply(lambda x: x.fillna(random.choice(x.dropna())), axis=1)

Candlestick answered 24/7, 2021 at 8:4 Comment(0)

A

0

Not the most concise, but probably the most performant way to go:

nans = df[col].isna()
non_nans = df.loc[df[col].notna(), col]
samples = np.random.choice(non_nans, size=nans.sum())
df.loc[nans, col] = samples

Adjectival answered 16/9, 2022 at 13:6 Comment(0)

S

0

Replacing NaN with a random number from the range:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan, 4, np.nan, 6, 7, np.nan, 9]})

min_value = 0
max_value = 10

df['A'] = df['A'].apply(lambda x: np.random.randint(min_value, max_value) if pd.isnull(x) else x)

print(df)

Say answered 3/10, 2023 at 13:9 Comment(1)

Welcome to Stack Overflow. Please don't forget to format your code, using Markdown help – Abarca 5/10, 2023 at 8:4

Recommended topics

Hot tags