Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
- Code 1: 22.06 Seconds
- Code 2: 0.03 Seconds
DataFrame with ~200k records
- Code 1: 180.06 Seconds
- Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
- Code 1: code kept running endlessly
- Code 2: 0.40 Seconds
DataFrame with ~13 Million records
- Code 1: --did not even try, after seeing performance on 1.6 Mn records--
- Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
df.fillna(df.mean())
will return the new dataframe, so you will have to writedf=df.fillna(df.mean())
to keep it. – Diallage