Filling missing values by mean in each group
Asked Answered
S

13

135

This should be straightforward, but the closest thing I've found is this post: pandas: Filling missing values within a group, and I still can't solve my problem....

Suppose I have the following dataframe

df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})

  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3

and I'd like to fill in "NaN" with mean value in each "name" group, i.e.

      name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3

I'm not sure where to go after:

grouped = df.groupby('name').mean()
Staple answered 13/11, 2013 at 22:43 Comment(0)
O
140

One way would be to use transform:

>>> df
  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
  name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3
Oddity answered 13/11, 2013 at 22:51 Comment(6)
I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.Oddity
Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.Hydrogenize
if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']Jerry
@Jerry Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))Conifer
I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..ThanksConsuelaconsuelo
Oh never mind I see the generalized solution thanks to @AndréC.AndersenConsuelaconsuelo
N
114

fillna + groupby + transform + mean

This seems intuitive:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.

Nolita answered 16/11, 2018 at 13:59 Comment(4)
Thanks !, I find that the lambda function a little bit confusing and yours much more understandable.Paton
Nice solution. My groupby returns 73k groups. So in other words it needed to find the mean of 73k groups in order to fill in the NA values for each group. My main concern here is timing as I want to easily scale it to more than 73k groups. The lambda solution took 21.39 seconds to finish while this solution took 0.27 seconds. Highly recommend going for this solution!Fernery
does df = df.fillna(df.groupby('name').transform('mean')) do this succesfully for all columns? I'm using that, it looks alright but I'm afraid I'm doing something wrong as all do per column here?Lessor
i had a really big dataset, this solution saved me hours compared to df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean())) Thank you!Addict
A
27

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

df = pd.DataFrame(
    {
        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
        'name': ['A','A', 'B','B','B','B', 'C','C','C'],
        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
    }
)

... gives ...

  category name  other_value value
0        X    A         10.0   1.0
1        X    A          NaN   NaN
2        X    B          NaN   NaN
3        X    B         20.0   2.0
4        X    B         30.0   3.0
5        X    B         10.0   1.0
6        Y    C         30.0   3.0
7        Y    C          NaN   NaN
8        Y    C         30.0   3.0

In this generalized case we would like to group by category and name, and impute only on value.

This can be solved as follows:

df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))

Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.

Performance test by increasing the dataset by doing ...

big_df = None
for _ in range(10000):
    if big_df is None:
        big_df = df.copy()
    else:
        big_df = pd.concat([big_df, df])
df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute:

import pandas as pd
from datetime import datetime

def generate_data():
    ...

t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)

# 0:00:00.016012

t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
    .transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)

# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
    .transform(lambda x: x.fillna(x.mean()))
Accompanist answered 28/7, 2017 at 12:6 Comment(3)
Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.AndersenConsuelaconsuelo
Hi @andre-c-andersen, I am trying to use your method for planets dataset, but it's not imputing all the values. Not sure why: https://stackoverflow.com/questions/73449902/fill-in-missing-values-with-groupby/73450241Val
@Val I looked into your question and made an answer.Conifer
N
17

Shortcut:

Groupby + Apply + Lambda + Fillna + Mean

>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
    0 

This solution still works if you want to group by multiple columns to replace missing values.

>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3], 
    'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})  

    
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
       
>>> df
        value name   class
    0    1.0    A     p
    1    1.0    A     p
    2    2.0    B     q
    3    2.0    B     q
    4    3.0    B     r
    5    3.0    B     r
    6    3.5    C     s
    7    4.0    C     s
    8    3.0    C     s
 
Novice answered 2/12, 2019 at 13:30 Comment(0)
R
14

I'd do it this way

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
Rapacious answered 18/11, 2016 at 17:18 Comment(1)
A slightly different version to this df['value_imputed'] = np.where(df.value.isnull(), df.groupby('group').value.transform('mean'), df.value)Mickel
R
6

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
    lambda x: x.fillna(x.mean()))
Raft answered 13/10, 2016 at 8:52 Comment(1)
This answer worked for me, thanks. Also for anyone new to pandas, can also index using slicing notation df.groupby("continent")['Crude_Birth_rate']... I believe this is the suggested covnentionTwila
F
5

To summarize all above concerning the efficiency of the possible solution I have a dataset with 97 906 rows and 48 columns. I want to fill in 4 columns with the median of each group. The column I want to group has 26 200 groups.

The first solution

start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds

The second solution

start = time.time()
for col in continuous_variables:
    df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds

The next solution I only performed on a subset since it was running too long.

start = time.time()
for col in continuous_variables:
    x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds

The following solution follows the same logic as above.

start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds

So it's quite important to choose the right method. Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).

Fernery answered 1/4, 2021 at 12:39 Comment(0)
F
3

I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.

Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.

What I would do here is

df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')

Or using fillna

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.

Finnougric answered 20/11, 2022 at 16:22 Comment(0)
J
2
def groupMeanValue(group):
    group['value'] = group['value'].fillna(group['value'].mean())
    return group

dft = df.groupby("name").transform(groupMeanValue)
Jus answered 9/3, 2016 at 14:36 Comment(0)
A
0

To fill all the numeric null values with the mean grouped by "name"

num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
Allegorize answered 9/1, 2023 at 15:36 Comment(0)
F
0

This is an easy understanding way to do it.

Use groupby + set_index + fillna

import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
print(df)
name_mean = df.groupby('name').mean()
df.set_index('name', inplace=True)
df = df.fillna(name_mean)
df.reset_index(inplace=True)
print(df)

Result

   value name
0    1.0    A
1    NaN    A
2    NaN    B
3    2.0    B
4    3.0    B
5    1.0    B
6    3.0    C
7    NaN    C
8    3.0    C
  name  value
0    A    1.0
1    A    1.0
2    B    2.0
3    B    2.0
4    B    3.0
5    B    1.0
6    C    3.0
7    C    3.0
8    C    3.0
Felsite answered 5/1 at 22:23 Comment(0)
B
-1
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
Burned answered 4/10, 2018 at 18:11 Comment(3)
Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?Faultless
@vino please add some explanationTangerine
That would be an interesting solution if it were working. It is the only one that does not rely on apply or lambdas (which leads to quite slow execution time, because it implies iterations in python world, rather than in C world). But the problem is that it doesn't work. It just produce a series associating index 0 to mean of As, that is 1, index 1 to mean of Bs=2, index 2 to mean of Cs=3. Then fillna replace, among rows 0, 1, 2 of df the NaN values by matching values in this mean table. So, filling row 1 with value 2, and row 2 with value 3. Which are both wrong. And letting row 7 with NaNFinnougric
H
-1

You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

Hunterhunting answered 28/9, 2019 at 19:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.