Pandas DataFrames with NaNs equality comparison
Asked Answered
B

7

78

In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:

ipdb> expect
                            1   2
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df
identifier                  1   2
timestamp
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df[1][0]
nan

ipdb> df[1][0], expect[1][0]
(nan, nan)

ipdb> df[1][0] == expect[1][0]
False

ipdb> df[1][1] == expect[1][1]
True

ipdb> type(df[1][0])
<type 'numpy.float64'>

ipdb> type(expect[1][0])
<type 'numpy.float64'>

ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])

ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False

Given that I'm trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?

What is the simplest way to compare equality of Series/DataFrames including NaNs?

Baribaric answered 11/10, 2013 at 16:3 Comment(0)
S
63

You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:

In [11]: from pandas.testing import assert_frame_equal

In [12]: assert_frame_equal(df, expected, check_names=False)

You can wrap this in a function with something like:

try:
    assert_frame_equal(df, expected, check_names=False)
    return True
except AssertionError:
    return False

In more recent pandas this functionality has been added as .equals:

df.equals(expected)
Swab answered 11/10, 2013 at 16:17 Comment(8)
I accepted this one rather than the others (which are useful!) since the real crux of my question was about comparing DataFrames - I've edited it a little to make this more clear.Baribaric
FYI, this is the slowest method as its done recursively and not vectorized (we use it for testing)Lovelovebird
@StevePike Not sure I understand this: "...since the real crux of my question was about comparing DataFrames". All solutions present at the moment are showing ways to compare DataFrames. Validating them is a slightly different question. Maybe I'm being a bit pedantic.Britton
This gives a FutureWarning when the objects are not equal: print assert_frame_equal(X, df) gives output None and warning /usr/local/lib/python2.7/site-packages/numpy/core/numeric.py:2367: FutureWarning: numpy equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (is)) and will change. return bool(asarray(a1 == a2).all())Loxodromic
@Loxodromic Interesting, I expect this will be fixed in pandas 0.16. (out soon)Swab
@Loxodromic which version of numpy/pandas are you using? I just updated to latest numpy (1.9.2) and do not see this message. Pandas 0.15.2.Swab
It appears that check_names=False ignores neither column nor index differences. Even with it set I get AssertionError: DataFrame.index are different and when I set the same index I still get AssertionError: DataFrame.columns are different. Only if I set the same index and columns then then everything is ok. Not sure what check_names is supposed to control.Panter
@river_jones check_names compares the df.index.names and df.columns.names attributesSwab
B
52

One of the properties of NaN is that NaN != NaN is True.

Check out this answer for a nice way to do this using numexpr.

(a == b) | ((a != a) & (b != b))

says this (in pseudocode):

a == b or (isnan(a) and isnan(b))

So, either a equals b, or both a and b are NaN.

If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.

In [1]: df = DataFrame(rand(1e7, 15))

In [2]: df = df[df > 0.5]

In [3]: df2 = df.copy()

In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 14
dtypes: float64(15)

In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
1 loops, best of 3: 598 ms per loop

timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:

In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
1 loops, best of 3: 687 ms per loop
Britton answered 11/10, 2013 at 16:16 Comment(7)
presumably after you need to .all().all() ?Swab
Much faster to use np.all() since this comparison guarantees a boolean result. It adds another ~100 ms to the running time. If you use the all() methods of pandas objects there's about a 3.5x perf hit.Britton
"Guarantees" meaning I'm assuming that someone is not using a subclass that could return NaN from __eq__/__ne__.Britton
you can do ((df == df2) | ((df != df) & (df2 != df2))).values.ravel().all()Lovelovebird
I'll add that to the timeit.Britton
BTW, no need to ravel if no axis is passed.Britton
no because best and worst case are equal, all values are checked in either caseBritton
L
11

Like @PhillipCloud answer, but more written out

In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])

In [27]: df2 = df1.copy()

They really are equivalent

In [28]: result = df1 == df2

In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [30]: result
Out[30]: 
      0     1
0  True  True
1  True  True

A nan in df2 that doesn't exist in df1

In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])

In [32]: result = df1 == df2

In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [34]: result
Out[34]: 
       0     1
0   True  True
1  False  True

You can also fill with a value you know not to be in the frame

In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]: 
      0     1
0  True  True
1  True  True
Lovelovebird answered 11/10, 2013 at 16:19 Comment(2)
If you have two non-equal values I think this will give True since they are both nonnull :sSwab
@Andy Hayden is correct: result[pd.isnull(df1) == pd.isnull(df2)] = True should be result[pd.isnull(df1) & pd.isnull(df2)] = TrueSwitch
E
6

Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.

Simply, df1.fillna('NULL') == df2.fillna('NULL'), if 'NULL' is not a value in the original data.

To be safe, do the following:

Example a) Compare two dataframes with NaN values

bools = (df1 == df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = True
assert bools.all().all()

Example b) Filter rows in df1 that do not match with df2

bools = (df1 != df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = False
df_outlier = df1[bools.all(axis=1)]

(Note: this is wrong - bools[pd.isnull(df1) == pd.isnull(df2)] = False)

Easley answered 20/8, 2015 at 16:32 Comment(2)
Your answer would look better if you add a bit more text to your answer, maybe explaining what you are doing.Heterochromatin
This strategy is faster than the accepted answer for small cases, but much slower for large onesZipangu
W
4
df.fillna(0) == df2.fillna(0)

You can use fillna(). Documenation here.

from pandas import DataFrame

# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df

# comparison fails!
print df == df2

# all is well 
print df.fillna(0) == df2.fillna(0)
Wood answered 1/10, 2015 at 18:8 Comment(1)
0 feels like a dangerous thing to fill with in general, but filling with some unique string or identifier sounds like a good strategy. This strategy is faster than the accepted answer for small cases, but much slower for large onesZipangu
C
3

Ever since pandas 1.1.0 there is df.compare(expect), which provides detailed results if any differences occur.

Circumcision answered 29/6, 2023 at 9:15 Comment(0)
L
0

Michel de Ruiter's answer should be the current answer. Posting an answer because I don't have enough reputation to add a comment to it. Use the dataframe compare method.

df.compare(expect) will return an empty dataframe (all axis are 0-length) if df and expect are equal, including NaN locations. You can use the empty dataframe property on the result.

assert(df.compare(expect).empty))
Latitudinarian answered 21/7, 2023 at 18:5 Comment(1)
This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From ReviewMaddox

© 2022 - 2024 — McMap. All rights reserved.