Pandas lambda function with Nan support [duplicate]
Asked Answered
B

5

20

I am trying to write a lambda function in Pandas that checks to see if Col1 is a Nan and if so, uses another column's data. I am having trouble getting code (below) to compile/execute correctly.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1': [1, 2, 3, np.NaN], 'Col2': [7, 8, 9, 10]})  
df2 = df.apply(lambda x: x['Col2'] if x['Col1'].isnull() else x['Col1'], axis=1)

Does anyone have any good idea on how to write a solution like this with a lambda function or have I exceeded the abilities of lambda? If not, do you have another solution?

Bleb answered 19/5, 2017 at 4:42 Comment(1)
This doesn't seem to work on my large DataFrame, but does work in an example. Could this be because in my actual data set has a different data type, so fillna won't work correctly? Both Col1 and Col2 in my actual set are dtype('O') so it shouldn't be a problem.Bleb
G
46

You need pandas.isnull to check if a scalar is NaN:

df = pd.DataFrame({'Col1': [1, 2, 3, np.NaN],
                   'Col2': [8, 9, 7, 10]})  
                 
df2 = df.apply(lambda x: x['Col2'] if pd.isnull(x['Col1']) else x['Col1'], axis=1)

print(df)
   Col1  Col2
0   1.0     8
1   2.0     9
2   3.0     7
3   NaN    10

print(df2)
0     1.0
1     2.0
2     3.0
3    10.0
dtype: float64

But better is use Series.combine_first:

df['Col1'] = df['Col1'].combine_first(df['Col2'])

print(df)
   Col1  Col2
0   1.0     8
1   2.0     9
2   3.0     7
3  10.0    10

Another solution with Series.update:

df['Col1'].update(df['Col2'])
print(df)
   Col1  Col2
0   8.0     8
1   9.0     9
2   7.0     7
3  10.0    10
Glair answered 19/5, 2017 at 5:10 Comment(9)
Thanks. Did you mean for your else in your first lambda method to be Col1 or Col2?Bleb
Hmmm, I think it is Col2 - it means get value of col2 if condition is True else get value col1Glair
But I prefer another solutions if need replace NaNs tby another column.Glair
Your first two methods work flawlessly. Just out of curiosity, why do you think it's better to use Series.combine_first rather than a lambda function on the df?Bleb
Because it is faster and vectorized function. But if small dataframe (100 rows), it is no problem. but if 1M rows, there is huge difference.Glair
Thanks--appreciate your replies! Still learning. On my actual data, the lambda function was an order of magnitude slower (8s vs. 0.8s).Bleb
Thank you for comment. apply is obviusly used if no pandas function and need write it. Good luck!Glair
Why combine_first instead of fillna? combine_first works in this case because the two series have the same index, but in other cases, you could get a longer series returned since "Result index will be the union of the two indexes" (from docs).Lyonnesse
@Lyonnesse - Yes, working with same DataFrame, so same indices. Generally you are right, should be used fillnaGlair
T
7

The correct solution to this problem is:

df['Col1'].fillna(df['Col2'], inplace=True)
Tenantry answered 19/5, 2017 at 4:48 Comment(2)
This doesn't seem to work on my large DataFrame, but does work in an example. Could this be because in my actual data set has a different data type, so fillna won't work correctly? Both Col1 and Col2 in my actual set are dtype('O') so it shouldn't be a problem.Bleb
Works for me with object data types also. What's the issue when you use the actual dataset?Tenantry
C
3

You need to use np.isnan()

#import numpy as np
df2 = df.apply(lambda x: 2 if np.isnan(x['Col1']) else 1, axis=1)   

df2
Out[1307]: 
0    1
1    1
2    1
3    2
dtype: int64
Catenate answered 19/5, 2017 at 4:59 Comment(2)
I was trying to round non-NaN values, and this worked whilst x is np.NaN didn't: df.age.apply(lambda x: x if np.isnan(x) else round(x))Munroe
@Munroe You should be using Pandas's builtin functions, which handle NaN automatically: df['age'].round(). Plus they're much faster. Secondly, don't use is on numbers in general; see numpy NaN not always recognized (which sounds specific to NaN, but the answers explain the more general problem).Lyonnesse
G
2

Within pandas 0.24.2, I use

df.apply(lambda x: x['col_name'] if x[col1] is np.nan else expressions_another, axis=1)

because pd.isnull() doesn't work.

in my work,I found the following phenomenon,

No running results:

df['prop'] = df.apply(lambda x: (x['buynumpday'] / x['cnumpday']) if pd.isnull(x['cnumpday']) else np.nan, axis=1)

Results exist:

df['prop'] = df.apply(lambda x: (x['buynumpday'] / x['cnumpday']) if x['cnumpday'] is not np.nan else np.nan, axis=1)

So far, I still don't know the deeper reason, but I have these experiences, for object, use is np.nan() or pd.isna(). For a float, use np.isnan() or pd.isna().

Gracielagracile answered 31/5, 2019 at 10:6 Comment(2)
For is np.nan(), you meant is np.nan, right? I'm not sure if you might have meant np.isnan().Lyonnesse
pd.isnull() wasn't working for you because you just made a typo: if pd.isnull(...) should have been if not pd.isnull(...). Regarding is np.nan, that's not guaranteed to work; see numpy NaN not always recognized as well as dataframe.apply(lambda x: x is np.nan) does not work. If your column is object dtype, that's why it happened to work, but again it's not guaranteed, e.g.: s = pd.Series([1, np.nan, float('nan')], dtype='object'); s.apply(lambda x: x is np.nan).values[False True False] vs s.isna().values[False True True]Lyonnesse
S
0

You can also use numpy function where with condition "Col1 is NaN"

import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1': [1, 2, 3, np.NaN], 'Col2': [7, 8, 9, 10]})

df['Col1'] = np.where(pd.isna(df['Col1']), df['Col2'], df['Col1'])
Solicitude answered 14/10, 2023 at 19:52 Comment(6)
Why? You can just use .fillna instead. Even if you don't want to use .fillna, why use NumPy when you can use Pandas? df['Col1'].mask(pd.isna(df['Col1']), df['Col2'])Lyonnesse
thank you for the correction. The phrase "equals NaN" can be corrected to "is NaN," and the reason is that it's a common way to change one column's value with another based on a specific conditionSolicitude
Right, I know it's common, but why not use the simpler way?Lyonnesse
I suggest another approach to deal with the issue of filling missing values. In my practice, there are other types of missing values, not only 'n/a.' My suggestion is more universal and can handle various types of missing values or incorrect values.Solicitude
Can't Series.mask do that too?Lyonnesse
yes, it can, but what's about time of running - my test shows 132 µs ± 455 for df mask and 81.6 µs ± 267 for numpy whereSolicitude

© 2022 - 2024 — McMap. All rights reserved.