Python comparison ignoring nan

Asked 25/1, 2018 at 22:27 Answered 4/2, 2021 at 16:14

Solved python python-2.7 pandas nan equality

While nan == nan is always False, in many cases people want to treat them as equal, and this is enshrined in pandas.DataFrame.equals:

NaNs in the same location are considered equal.

Of course, I can write

def equalp(x, y):
    return (x == y) or (math.isnan(x) and math.isnan(y))

However, this will fail on containers like [float("nan")] and isnan barfs on non-numbers (so the complexity increases).

So, what do people do to compare complex Python objects which may contain nan?

PS. Motivation: when comparing two rows in a pandas DataFrame, I would convert them into dicts and compare dicts element-wise.

PPS. When I say "compare", I am thinking diff, not equalp.

Institutionalism answered 25/1, 2018 at 22:27 Comment(10)

If you're asking what people do... then the answer is, they usually don't. Having non-scalar/object columns is usually considered bad form, and introduces a lot of headaches you could otherwise avoid by flattening your data a bit. It's also a less-performant option. – Mcleod 25/1, 2018 at 22:31

@cᴏʟᴅsᴘᴇᴇᴅ I think they mean when outside of pandas containers, like lists with float('nan') in them. – Cropper 25/1, 2018 at 22:31

I think most people just accept that Python knows best and NaN != NaN. Or try to avoid having NaN altogether. – Pearlpearla 25/1, 2018 at 22:34

Hmm, in that case, are your lists always integers or floats? – Mcleod 25/1, 2018 at 22:37

Yeah, at this point, you might as well use something like NAN = object() then replace float('nan') with NAN – Cropper 25/1, 2018 at 22:37

@Institutionalism why would you do this? "Motivation: when comparing two rows in a pandas DataFrame, I would convert them into dicts and compare dicts element-wise." – Cropper 25/1, 2018 at 23:2

@sds: Like juanpa said, do you really need the dict (maybe for other operations)? There is also df.as_matrix() which would make things easier. – Schiffman 25/1, 2018 at 23:14

@juanpa.arrivillaga: how would you compare two rows or length 400? – Institutionalism 26/1, 2018 at 2:32

@Institutionalism df.iloc[1,:].equals(df.iloc[2:])? – Cropper 26/1, 2018 at 2:38

@juanpa.arrivillaga: okay, I got False. How do I get the list of columns where the rows are different? – Institutionalism 26/1, 2018 at 2:40

Suppose you have a data-frame with nan values:

In [10]: df = pd.DataFrame(np.random.randint(0, 20, (10, 10)).astype(float), columns=["c%d"%d for d in range(10)])

In [10]: df.where(np.random.randint(0,2, df.shape).astype(bool), np.nan, inplace=True)

In [10]: df
Out[10]:
     c0    c1    c2    c3    c4    c5    c6    c7   c8    c9
0   NaN   6.0  14.0   NaN   5.0   NaN   2.0  12.0  3.0   7.0
1   NaN   6.0   5.0  17.0   NaN   NaN  13.0   NaN  NaN   NaN
2   NaN  17.0   NaN   8.0   6.0   NaN   NaN  13.0  NaN   NaN
3   3.0   NaN   NaN  15.0   NaN   8.0   3.0   NaN  3.0   NaN
4   7.0   8.0   7.0   NaN   9.0  19.0   NaN   0.0  NaN  11.0
5   NaN   NaN  14.0   2.0   NaN   NaN   0.0   NaN  NaN   8.0
6   3.0  13.0   NaN   NaN   NaN   NaN   NaN  12.0  3.0   NaN
7  13.0  14.0   NaN   5.0  13.0   NaN  18.0   6.0  NaN   5.0
8   3.0   9.0  14.0  19.0  11.0   NaN   NaN   NaN  NaN   5.0
9   3.0  17.0   NaN   NaN   0.0   NaN  11.0   NaN  NaN   0.0

And you want to compare rows, say, row 0 and 8. Then just use fillna and do vectorized comparison:

In [12]: df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)
Out[12]:
c0     True
c1     True
c2    False
c3     True
c4     True
c5    False
c6     True
c7     True
c8     True
c9     True
dtype: bool

You can use the resulting boolean array to index into the columns, if you just want to know which columns are different:

In [14]: df.columns[df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)]
Out[14]: Index(['c0', 'c1', 'c3', 'c4', 'c6', 'c7', 'c8', 'c9'], dtype='object')

Cropper answered 26/1, 2018 at 3:5 Comment(2)

This gives you False for entries that were NaN in one column and 0 in the other – Wattle 17/7, 2023 at 16:22

@DiegoFMedina yes, totally. I was being too clever for my own good. This will only work if you know a good value to use for a fillvalue (due to the nature of your data). Alternatively, you can do something like: row0 = df.iloc[0, :]; row8 = df.iloc[8,:]; then (row0 == row8) | (row0.isnull() & row8.isnull()) to find the columns that are equal treating NaNs as equal – Cropper 17/7, 2023 at 16:44

I assume you have array-data or can at least convert to a numpy array?

One way is to mask all the nans using a numpy.maarray, then comparing the arrays. So your starting situation would be sth. like this

import numpy as np
import numpy.ma as ma
arr1 = ma.array([3,4,6,np.nan,2])
arr2 = ma.array([3,4,6,np.nan,2])

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [ True  True  True False  True]
>>> False  # <-- you want this to show True

Solution:

arr1[np.isnan(arr1)] = ma.masked
arr2[np.isnan(arr2)] = ma.masked

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [True True True -- True]
>>> True

Schiffman answered 25/1, 2018 at 22:37 Comment(1)

this is not what I am looking for, please see motivation in PS – Institutionalism 25/1, 2018 at 22:44

Here's a function that recurses into a data structure replacing nan values with a unique string. I wrote this for a unit test that compares data structures that may contain nan.

It's only designed for data structures made of dict and list, but it's easy to see how to expand it.

from math import isnan
from uuid import uuid4
from typing import Union

NAN_REPLACEMENT = f"THIS_WAS_A_NAN{uuid4()}"

def replace_nans(data_structure: Union[dict, list]) -> Union[dict, list]:
    if isinstance(data_structure, dict):
        iterme = data_structure.items()
    elif isinstance(data_structure, list):
        iterme = enumerate(data_structure)
    else:
        raise ValueError(
            "replace_nans should only be called on structures made of dicts and lists"
        )

    for key, value in iterme:
        if isinstance(value, float) and isnan(value):
            data_structure[key] = NAN_REPLACEMENT
        elif isinstance(value, dict) or isinstance(value, list):
            data_structure[key] = replace_nans(data_structure[key])
    return data_structure

Gainsborough answered 4/2, 2021 at 16:14 Comment(0)

Recommended topics

Hot tags