pandas DataFrame: replace nan values with average of columns
Asked Answered
M

12

305

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.

How can I replace the nans with averages of columns where they are?

This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.

Melville answered 8/9, 2013 at 23:54 Comment(0)
M
409

You can simply use DataFrame.fillna to fill the nan's directly:

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

In [28]: df.mean()
Out[28]: 
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64

In [29]: df.fillna(df.mean())
Out[29]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().

Moony answered 9/9, 2013 at 5:27 Comment(7)
df.fillna(df.mean()) will return the new dataframe, so you will have to write df=df.fillna(df.mean()) to keep it.Diallage
any ideas why I might be getting the wrong amount imputed for the mean using this?Funkhouser
Instead of df=df.fillna(df.mean()) you also could use df.fillna(df.mean(), inplace=True)Terceira
CAUTION: if you want to use this for Machine Learning / Data Science: from a Data Science perspective it is wrong to first replace NA and then split into train and test... You MUST first split into train and test, then replace NA by mean on train and then apply this stateful preprocessing model to test, see the answer involving sklearn below!Insectivore
@FabianWerner. Why is it wrong? I think its a better idea if NaNs are replaced by global average rather than train_mean and test_mean separately, as train/test_mean can be misleading.Quadrisect
@Quadrisect because otherwise you are leaking information from the test set into the training set! Imagine it like this: We have 100 data rows and we consider column x. The first 99 entries of x are NA. We want to split off row 100 as a test set. Let's assume row 100 has value 20 in column x. Then you will replace all entries in the training set in column x with 20, a value coming 100% from the test set. Hence, the evaluation might fool you!Insectivore
Curious, why is it not df.fillna(df.mean(1)) if we want the column average? Doesn't it default to the index by default? Edit: I see that just putting () works fine, I just don't understand why...Fantasm
M
101

Try:

sub2['income'].fillna((sub2['income'].mean()), inplace=True)
Maestro answered 16/10, 2015 at 20:18 Comment(2)
For someone wondering about inplace = True: If it is True it the original object is modified with this change. If it is False (default) the function doesn't modify the original object, instead it returns a modified copy of it and you have to assign it to the original object to replace it.Sorghum
Hi, can you please check this "#76706511"Nutbrown
H
50

Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:

df.fillna(df.mean())

In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.

I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value

#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----

df.fillna(df.mean())

#------------------------------------------------
#----(Code 2) Selective Treatment----------------

for i in df.columns[df.isnull().any(axis=0)]:     #---Applying Only on variables with NaN values
    df[i].fillna(df[i].mean(),inplace=True)

#---df.isnull().any(axis=0) gives True/False flag (Boolean value series), 
#---which when applied on df.columns[], helps identify variables with NaN values

Below is the performance i observed, as i kept on increasing the # records in DataFrame

DataFrame with ~100k records

  • Code 1: 22.06 Seconds
  • Code 2: 0.03 Seconds

DataFrame with ~200k records

  • Code 1: 180.06 Seconds
  • Code 2: 0.06 Seconds

DataFrame with ~1.6 Million records

  • Code 1: code kept running endlessly
  • Code 2: 0.40 Seconds

DataFrame with ~13 Million records

  • Code 1: --did not even try, after seeing performance on 1.6 Mn records--
  • Code 2: 3.20 Seconds

Apologies for a long answer ! Hope this helps !

Haberdashery answered 29/7, 2020 at 17:8 Comment(2)
I have timed it all above methods and yours is the fastest. Thanks man.Posen
Fast!!!!!!!!!!!Imaginal
F
42
In [16]: df = DataFrame(np.random.randn(10,3))

In [17]: df.iloc[3:5,0] = np.nan

In [18]: df.iloc[4:6,1] = np.nan

In [19]: df.iloc[5:8,2] = np.nan

In [20]: df
Out[20]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3       NaN -0.985188 -0.324136
4       NaN       NaN  0.238512
5  0.769657       NaN       NaN
6  0.141951  0.326064       NaN
7 -1.694475 -0.523440       NaN
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

In [22]: df.mean()
Out[22]: 
0   -0.251534
1   -0.040622
2   -0.841219
dtype: float64

Apply per-column the mean of that columns and fill

In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622  0.238512
5  0.769657 -0.040622 -0.841219
6  0.141951  0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Franzen answered 9/9, 2013 at 0:15 Comment(1)
I dont know why, but df.fillna(df.mean()) didnt work, only your version with apply. Python 3Bis
L
16

If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.

sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
Locus answered 26/2, 2017 at 3:15 Comment(2)
Please provide some explanation of how this solves the problem.Romero
Is it possible to do a Group By for the mean as well?Andreaandreana
V
16
# To read data from csv file
Dataset = pd.read_csv('Data.csv')

X = Dataset.iloc[:, :-1].values

# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Validity answered 10/7, 2017 at 18:54 Comment(2)
What's the advantage of all this over the far simpler alternatives?Cullet
@Roshan Jha It is always better to explain the logic. There are many ways to do the same task in R & Python. However, if you are suggesting something different, you may want to point out some advanatages of doing soAbey
F
14

Directly use df.fillna(df.mean()) to fill all the null value with mean

If you want to fill null value with mean of that column then you can use this

suppose x=df['Item_Weight'] here Item_Weight is column name

here we are assigning (fill null values of x with mean of x into x)

df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))

If you want to fill null value with some string then use

here Outlet_size is column name

df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Flimflam answered 27/6, 2018 at 22:19 Comment(0)
H
14

Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column

Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']

If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:

Use method .fillna():

mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)

I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.

You should be careful when using the mean. If you have outliers is more recommendable to use the median

Hashimoto answered 4/2, 2019 at 14:2 Comment(0)
G
10

Another option besides those above is:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.

Gammer answered 15/11, 2016 at 19:40 Comment(0)
T
3

using sklearn library preprocessing class

from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])

Note: In the recent version parameter missing_values value change to np.nan from NaN

Treehopper answered 12/3, 2020 at 3:58 Comment(0)
C
1

I use this method to fill missing values by average of a column.

fill_mean = lambda col : col.fillna(col.mean())

df = df.apply(fill_mean, axis = 0)
Creigh answered 25/5, 2021 at 9:28 Comment(0)
U
0

You can also use value_counts to get the most frequent values. This would work on different datatypes.

df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

Here is the value_counts api reference.

Uxmal answered 7/7, 2021 at 9:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.