Python comparison ignoring nan
Asked Answered
I

3

17

While nan == nan is always False, in many cases people want to treat them as equal, and this is enshrined in pandas.DataFrame.equals:

NaNs in the same location are considered equal.

Of course, I can write

def equalp(x, y):
    return (x == y) or (math.isnan(x) and math.isnan(y))

However, this will fail on containers like [float("nan")] and isnan barfs on non-numbers (so the complexity increases).

So, what do people do to compare complex Python objects which may contain nan?

PS. Motivation: when comparing two rows in a pandas DataFrame, I would convert them into dicts and compare dicts element-wise.

PPS. When I say "compare", I am thinking diff, not equalp.

Institutionalism answered 25/1, 2018 at 22:27 Comment(10)
If you're asking what people do... then the answer is, they usually don't. Having non-scalar/object columns is usually considered bad form, and introduces a lot of headaches you could otherwise avoid by flattening your data a bit. It's also a less-performant option.Mcleod
@cᴏʟᴅsᴘᴇᴇᴅ I think they mean when outside of pandas containers, like lists with float('nan') in them.Cropper
I think most people just accept that Python knows best and NaN != NaN. Or try to avoid having NaN altogether.Pearlpearla
Hmm, in that case, are your lists always integers or floats?Mcleod
Yeah, at this point, you might as well use something like NAN = object() then replace float('nan') with NANCropper
@Institutionalism why would you do this? "Motivation: when comparing two rows in a pandas DataFrame, I would convert them into dicts and compare dicts element-wise."Cropper
@sds: Like juanpa said, do you really need the dict (maybe for other operations)? There is also df.as_matrix() which would make things easier.Schiffman
@juanpa.arrivillaga: how would you compare two rows or length 400?Institutionalism
@Institutionalism df.iloc[1,:].equals(df.iloc[2:])?Cropper
@juanpa.arrivillaga: okay, I got False. How do I get the list of columns where the rows are different?Institutionalism
C
10

Suppose you have a data-frame with nan values:

In [10]: df = pd.DataFrame(np.random.randint(0, 20, (10, 10)).astype(float), columns=["c%d"%d for d in range(10)])

In [10]: df.where(np.random.randint(0,2, df.shape).astype(bool), np.nan, inplace=True)

In [10]: df
Out[10]:
     c0    c1    c2    c3    c4    c5    c6    c7   c8    c9
0   NaN   6.0  14.0   NaN   5.0   NaN   2.0  12.0  3.0   7.0
1   NaN   6.0   5.0  17.0   NaN   NaN  13.0   NaN  NaN   NaN
2   NaN  17.0   NaN   8.0   6.0   NaN   NaN  13.0  NaN   NaN
3   3.0   NaN   NaN  15.0   NaN   8.0   3.0   NaN  3.0   NaN
4   7.0   8.0   7.0   NaN   9.0  19.0   NaN   0.0  NaN  11.0
5   NaN   NaN  14.0   2.0   NaN   NaN   0.0   NaN  NaN   8.0
6   3.0  13.0   NaN   NaN   NaN   NaN   NaN  12.0  3.0   NaN
7  13.0  14.0   NaN   5.0  13.0   NaN  18.0   6.0  NaN   5.0
8   3.0   9.0  14.0  19.0  11.0   NaN   NaN   NaN  NaN   5.0
9   3.0  17.0   NaN   NaN   0.0   NaN  11.0   NaN  NaN   0.0

And you want to compare rows, say, row 0 and 8. Then just use fillna and do vectorized comparison:

In [12]: df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)
Out[12]:
c0     True
c1     True
c2    False
c3     True
c4     True
c5    False
c6     True
c7     True
c8     True
c9     True
dtype: bool

You can use the resulting boolean array to index into the columns, if you just want to know which columns are different:

In [14]: df.columns[df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)]
Out[14]: Index(['c0', 'c1', 'c3', 'c4', 'c6', 'c7', 'c8', 'c9'], dtype='object')
Cropper answered 26/1, 2018 at 3:5 Comment(2)
This gives you False for entries that were NaN in one column and 0 in the otherWattle
@DiegoFMedina yes, totally. I was being too clever for my own good. This will only work if you know a good value to use for a fillvalue (due to the nature of your data). Alternatively, you can do something like: row0 = df.iloc[0, :]; row8 = df.iloc[8,:]; then (row0 == row8) | (row0.isnull() & row8.isnull()) to find the columns that are equal treating NaNs as equalCropper
S
4

I assume you have array-data or can at least convert to a numpy array?

One way is to mask all the nans using a numpy.maarray, then comparing the arrays. So your starting situation would be sth. like this

import numpy as np
import numpy.ma as ma
arr1 = ma.array([3,4,6,np.nan,2])
arr2 = ma.array([3,4,6,np.nan,2])

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [ True  True  True False  True]
>>> False  # <-- you want this to show True

Solution:

arr1[np.isnan(arr1)] = ma.masked
arr2[np.isnan(arr2)] = ma.masked

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [True True True -- True]
>>> True
Schiffman answered 25/1, 2018 at 22:37 Comment(1)
this is not what I am looking for, please see motivation in PSInstitutionalism
G
0

Here's a function that recurses into a data structure replacing nan values with a unique string. I wrote this for a unit test that compares data structures that may contain nan.

It's only designed for data structures made of dict and list, but it's easy to see how to expand it.

from math import isnan
from uuid import uuid4
from typing import Union

NAN_REPLACEMENT = f"THIS_WAS_A_NAN{uuid4()}"

def replace_nans(data_structure: Union[dict, list]) -> Union[dict, list]:
    if isinstance(data_structure, dict):
        iterme = data_structure.items()
    elif isinstance(data_structure, list):
        iterme = enumerate(data_structure)
    else:
        raise ValueError(
            "replace_nans should only be called on structures made of dicts and lists"
        )

    for key, value in iterme:
        if isinstance(value, float) and isnan(value):
            data_structure[key] = NAN_REPLACEMENT
        elif isinstance(value, dict) or isinstance(value, list):
            data_structure[key] = replace_nans(data_structure[key])
    return data_structure
Gainsborough answered 4/2, 2021 at 16:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.