Python/Pandas Dataframe replace 0 with median value
Asked Answered
B

5

15

I have a python pandas dataframe with several columns and one column has 0 values. I want to replace the 0 values with the median or mean of this column.

data is my dataframe
artist_hotness is the column

mean_artist_hotness = data['artist_hotness'].dropna().mean()

if len(data.artist_hotness[ data.artist_hotness.isnull() ]) > 0:
data.artist_hotness.loc[ (data.artist_hotness.isnull()), 'artist_hotness'] = mean_artist_hotness

I tried this, but it is not working.

Brindle answered 29/5, 2016 at 5:21 Comment(0)
T
11

I think you can use mask and add parameter skipna=True to mean instead dropna. Also need change condition to data.artist_hotness == 0 if need replace 0 values or data.artist_hotness.isnull() if need replace NaN values:

import pandas as pd
import numpy as np

data = pd.DataFrame({'artist_hotness': [0,1,5,np.nan]})
print (data)
   artist_hotness
0             0.0
1             1.0
2             5.0
3             NaN

mean_artist_hotness = data['artist_hotness'].mean(skipna=True)
print (mean_artist_hotness)
2.0

data['artist_hotness']=data.artist_hotness.mask(data.artist_hotness == 0,mean_artist_hotness)
print (data)
   artist_hotness
0             2.0
1             1.0
2             5.0
3             NaN

Alternatively use loc, but omit column name:

data.loc[data.artist_hotness == 0, 'artist_hotness'] = mean_artist_hotness
print (data)
   artist_hotness
0             2.0
1             1.0
2             5.0
3             NaN

data.artist_hotness.loc[data.artist_hotness == 0, 'artist_hotness'] = mean_artist_hotness
print (data)

IndexingError: (0 True 1 False 2 False 3 False Name: artist_hotness, dtype: bool, 'artist_hotness')

Another solution is DataFrame.replace with specifying columns:

data=data.replace({'artist_hotness': {0: mean_artist_hotness}}) 
print (data)
    aa  artist_hotness
0  0.0             2.0
1  1.0             1.0
2  5.0             5.0
3  NaN             NaN 

Or if need replace all 0 values in all columns:

import pandas as pd
import numpy as np

data = pd.DataFrame({'artist_hotness': [0,1,5,np.nan], 'aa': [0,1,5,np.nan]})
print (data)
    aa  artist_hotness
0  0.0             0.0
1  1.0             1.0
2  5.0             5.0
3  NaN             NaN

mean_artist_hotness = data['artist_hotness'].mean(skipna=True)
print (mean_artist_hotness)
2.0

data=data.replace(0,mean_artist_hotness) 
print (data)
    aa  artist_hotness
0  2.0             2.0
1  1.0             1.0
2  5.0             5.0
3  NaN             NaN

If need replace NaN in all columns use DataFrame.fillna:

data=data.fillna(mean_artist_hotness) 
print (data)
    aa  artist_hotness
0  0.0             0.0
1  1.0             1.0
2  5.0             5.0
3  2.0             2.0

But if only in some columns use Series.fillna:

data['artist_hotness'] = data.artist_hotness.fillna(mean_artist_hotness) 
print (data)
    aa  artist_hotness
0  0.0             0.0
1  1.0             1.0
2  5.0             5.0
3  NaN             2.0
Tequilater answered 29/5, 2016 at 5:34 Comment(0)
H
19

use pandas replace method:

df = pd.DataFrame({'a': [1,2,3,4,0,0,0,0], 'b': [2,3,4,6,0,5,3,8]}) 

df 
   a  b
0  1  2
1  2  3
2  3  4
3  4  6
4  0  0
5  0  5
6  0  3
7  0  8

df['a']=df['a'].replace(0,df['a'].mean())

df
   a  b
0  1  2
1  2  3
2  3  4
3  4  6
4  1  0
5  1  5
6  1  3
7  1  8
Hokeypokey answered 29/5, 2016 at 5:53 Comment(0)
T
11

I think you can use mask and add parameter skipna=True to mean instead dropna. Also need change condition to data.artist_hotness == 0 if need replace 0 values or data.artist_hotness.isnull() if need replace NaN values:

import pandas as pd
import numpy as np

data = pd.DataFrame({'artist_hotness': [0,1,5,np.nan]})
print (data)
   artist_hotness
0             0.0
1             1.0
2             5.0
3             NaN

mean_artist_hotness = data['artist_hotness'].mean(skipna=True)
print (mean_artist_hotness)
2.0

data['artist_hotness']=data.artist_hotness.mask(data.artist_hotness == 0,mean_artist_hotness)
print (data)
   artist_hotness
0             2.0
1             1.0
2             5.0
3             NaN

Alternatively use loc, but omit column name:

data.loc[data.artist_hotness == 0, 'artist_hotness'] = mean_artist_hotness
print (data)
   artist_hotness
0             2.0
1             1.0
2             5.0
3             NaN

data.artist_hotness.loc[data.artist_hotness == 0, 'artist_hotness'] = mean_artist_hotness
print (data)

IndexingError: (0 True 1 False 2 False 3 False Name: artist_hotness, dtype: bool, 'artist_hotness')

Another solution is DataFrame.replace with specifying columns:

data=data.replace({'artist_hotness': {0: mean_artist_hotness}}) 
print (data)
    aa  artist_hotness
0  0.0             2.0
1  1.0             1.0
2  5.0             5.0
3  NaN             NaN 

Or if need replace all 0 values in all columns:

import pandas as pd
import numpy as np

data = pd.DataFrame({'artist_hotness': [0,1,5,np.nan], 'aa': [0,1,5,np.nan]})
print (data)
    aa  artist_hotness
0  0.0             0.0
1  1.0             1.0
2  5.0             5.0
3  NaN             NaN

mean_artist_hotness = data['artist_hotness'].mean(skipna=True)
print (mean_artist_hotness)
2.0

data=data.replace(0,mean_artist_hotness) 
print (data)
    aa  artist_hotness
0  2.0             2.0
1  1.0             1.0
2  5.0             5.0
3  NaN             NaN

If need replace NaN in all columns use DataFrame.fillna:

data=data.fillna(mean_artist_hotness) 
print (data)
    aa  artist_hotness
0  0.0             0.0
1  1.0             1.0
2  5.0             5.0
3  2.0             2.0

But if only in some columns use Series.fillna:

data['artist_hotness'] = data.artist_hotness.fillna(mean_artist_hotness) 
print (data)
    aa  artist_hotness
0  0.0             0.0
1  1.0             1.0
2  5.0             5.0
3  NaN             2.0
Tequilater answered 29/5, 2016 at 5:34 Comment(0)
N
2

Found these very useful, although mask is really slow (not sure why).

I did this:

df.loc[ df['artist_hotness'] == 0 | np.isnan(df['artist_hotness']), 'artist_hotness' ] = df['artist_hotness'].median()
Narton answered 19/6, 2018 at 21:27 Comment(0)
C
1
data['artist_hotness'] = data['artist_hotness'].map( lambda x : data.artist_hotness.mean() if x == 0 else x)
Camelopardalis answered 29/5, 2016 at 6:14 Comment(0)
L
0

I think below code will solve your problem in one line.

    data['artist_hotness'] = data['artist_hotness'].replace(0, 
    data['artist_hotness'].mean())
Lox answered 31/3, 2022 at 14:37 Comment(1)
While this code may solve the question, including an explanation of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please edit your answer to add explanations and give an indication of what limitations and assumptions apply.Anthropocentric

© 2022 - 2024 — McMap. All rights reserved.