Pandas lambda function with Nan support [duplicate]

Asked 19/5, 2017 at 4:42 Answered 14/10, 2023 at 19:52

I am trying to write a lambda function in Pandas that checks to see if Col1 is a Nan and if so, uses another column's data. I am having trouble getting code (below) to compile/execute correctly.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1': [1, 2, 3, np.NaN], 'Col2': [7, 8, 9, 10]})  
df2 = df.apply(lambda x: x['Col2'] if x['Col1'].isnull() else x['Col1'], axis=1)

Does anyone have any good idea on how to write a solution like this with a lambda function or have I exceeded the abilities of lambda? If not, do you have another solution?

Bleb answered 19/5, 2017 at 4:42 Comment(1)

This doesn't seem to work on my large DataFrame, but does work in an example. Could this be because in my actual data set has a different data type, so fillna won't work correctly? Both Col1 and Col2 in my actual set are dtype('O') so it shouldn't be a problem. – Bleb 19/5, 2017 at 13:51

You need pandas.isnull to check if a scalar is NaN:

df = pd.DataFrame({'Col1': [1, 2, 3, np.NaN],
                   'Col2': [8, 9, 7, 10]})  
                 
df2 = df.apply(lambda x: x['Col2'] if pd.isnull(x['Col1']) else x['Col1'], axis=1)

print(df)
   Col1  Col2
0   1.0     8
1   2.0     9
2   3.0     7
3   NaN    10

print(df2)
0     1.0
1     2.0
2     3.0
3    10.0
dtype: float64

But better is use Series.combine_first:

df['Col1'] = df['Col1'].combine_first(df['Col2'])

print(df)
   Col1  Col2
0   1.0     8
1   2.0     9
2   3.0     7
3  10.0    10

Another solution with Series.update:

df['Col1'].update(df['Col2'])
print(df)
   Col1  Col2
0   8.0     8
1   9.0     9
2   7.0     7
3  10.0    10

Glair answered 19/5, 2017 at 5:10 Comment(9)

Thanks. Did you mean for your else in your first lambda method to be Col1 or Col2? – Bleb 19/5, 2017 at 5:42

Hmmm, I think it is Col2 - it means get value of col2 if condition is True else get value col1 – Glair 19/5, 2017 at 5:44

But I prefer another solutions if need replace NaNs tby another column. – Glair 19/5, 2017 at 5:45

Your first two methods work flawlessly. Just out of curiosity, why do you think it's better to use Series.combine_first rather than a lambda function on the df? – Bleb 19/5, 2017 at 14:3

Because it is faster and vectorized function. But if small dataframe (100 rows), it is no problem. but if 1M rows, there is huge difference. – Glair 19/5, 2017 at 14:4

Thanks--appreciate your replies! Still learning. On my actual data, the lambda function was an order of magnitude slower (8s vs. 0.8s). – Bleb 19/5, 2017 at 14:43

Thank you for comment. apply is obviusly used if no pandas function and need write it. Good luck! – Glair 19/5, 2017 at 14:46

Why combine_first instead of fillna? combine_first works in this case because the two series have the same index, but in other cases, you could get a longer series returned since "Result index will be the union of the two indexes" (from docs). – Lyonnesse 14/10, 2023 at 19:11

@Lyonnesse - Yes, working with same DataFrame, so same indices. Generally you are right, should be used fillna – Glair 16/10, 2023 at 5:45

The correct solution to this problem is:

df['Col1'].fillna(df['Col2'], inplace=True)

Tenantry answered 19/5, 2017 at 4:48 Comment(2)

Works for me with object data types also. What's the issue when you use the actual dataset? – Tenantry 19/5, 2017 at 16:43

You need to use np.isnan()

#import numpy as np
df2 = df.apply(lambda x: 2 if np.isnan(x['Col1']) else 1, axis=1)   

df2
Out[1307]: 
0    1
1    1
2    1
3    2
dtype: int64

Catenate answered 19/5, 2017 at 4:59 Comment(2)

I was trying to round non-NaN values, and this worked whilst x is np.NaN didn't: df.age.apply(lambda x: x if np.isnan(x) else round(x)) – Munroe 19/7, 2023 at 9:22

@Munroe You should be using Pandas's builtin functions, which handle NaN automatically: df['age'].round(). Plus they're much faster. Secondly, don't use is on numbers in general; see numpy NaN not always recognized (which sounds specific to NaN, but the answers explain the more general problem). – Lyonnesse 14/10, 2023 at 18:51

Within pandas 0.24.2, I use

df.apply(lambda x: x['col_name'] if x[col1] is np.nan else expressions_another, axis=1)

because pd.isnull() doesn't work.

in my work,I found the following phenomenon,

No running results:

df['prop'] = df.apply(lambda x: (x['buynumpday'] / x['cnumpday']) if pd.isnull(x['cnumpday']) else np.nan, axis=1)

Results exist:

df['prop'] = df.apply(lambda x: (x['buynumpday'] / x['cnumpday']) if x['cnumpday'] is not np.nan else np.nan, axis=1)

So far, I still don't know the deeper reason, but I have these experiences, for object, use is np.nan() or pd.isna(). For a float, use np.isnan() or pd.isna().

Gracielagracile answered 31/5, 2019 at 10:6 Comment(2)

For is np.nan(), you meant is np.nan, right? I'm not sure if you might have meant np.isnan(). – Lyonnesse 14/10, 2023 at 19:31

pd.isnull() wasn't working for you because you just made a typo: if pd.isnull(...) should have been if not pd.isnull(...). Regarding is np.nan, that's not guaranteed to work; see numpy NaN not always recognized as well as dataframe.apply(lambda x: x is np.nan) does not work. If your column is object dtype, that's why it happened to work, but again it's not guaranteed, e.g.: s = pd.Series([1, np.nan, float('nan')], dtype='object'); s.apply(lambda x: x is np.nan).values → [False True False] vs s.isna().values → [False True True] – Lyonnesse 14/10, 2023 at 19:38

You can also use numpy function where with condition "Col1 is NaN"

import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1': [1, 2, 3, np.NaN], 'Col2': [7, 8, 9, 10]})

df['Col1'] = np.where(pd.isna(df['Col1']), df['Col2'], df['Col1'])

Solicitude answered 14/10, 2023 at 19:52 Comment(6)

Why? You can just use .fillna instead. Even if you don't want to use .fillna, why use NumPy when you can use Pandas? df['Col1'].mask(pd.isna(df['Col1']), df['Col2']) – Lyonnesse 14/10, 2023 at 20:20

thank you for the correction. The phrase "equals NaN" can be corrected to "is NaN," and the reason is that it's a common way to change one column's value with another based on a specific condition – Solicitude 14/10, 2023 at 20:26

Right, I know it's common, but why not use the simpler way? – Lyonnesse 14/10, 2023 at 20:29

I suggest another approach to deal with the issue of filling missing values. In my practice, there are other types of missing values, not only 'n/a.' My suggestion is more universal and can handle various types of missing values or incorrect values. – Solicitude 14/10, 2023 at 20:35

Can't Series.mask do that too? – Lyonnesse 14/10, 2023 at 20:36

yes, it can, but what's about time of running - my test shows 132 µs ± 455 for df mask and 81.6 µs ± 267 for numpy where – Solicitude 14/10, 2023 at 20:46

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags