Pandas DataFrames with NaNs equality comparison

Asked 11/10, 2013 at 16:3 Answered 21/7, 2023 at 18:5

In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:

ipdb> expect
                            1   2
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df
identifier                  1   2
timestamp
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df[1][0]
nan

ipdb> df[1][0], expect[1][0]
(nan, nan)

ipdb> df[1][0] == expect[1][0]
False

ipdb> df[1][1] == expect[1][1]
True

ipdb> type(df[1][0])
<type 'numpy.float64'>

ipdb> type(expect[1][0])
<type 'numpy.float64'>

ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])

ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False

Given that I'm trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?

What is the simplest way to compare equality of Series/DataFrames including NaNs?

Baribaric answered 11/10, 2013 at 16:3 Comment(0)

You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:

In [11]: from pandas.testing import assert_frame_equal

In [12]: assert_frame_equal(df, expected, check_names=False)

You can wrap this in a function with something like:

try:
    assert_frame_equal(df, expected, check_names=False)
    return True
except AssertionError:
    return False

In more recent pandas this functionality has been added as .equals:

df.equals(expected)

Swab answered 11/10, 2013 at 16:17 Comment(8)

I accepted this one rather than the others (which are useful!) since the real crux of my question was about comparing DataFrames - I've edited it a little to make this more clear. – Baribaric 11/10, 2013 at 16:28

FYI, this is the slowest method as its done recursively and not vectorized (we use it for testing) – Lovelovebird 11/10, 2013 at 16:29

@StevePike Not sure I understand this: "...since the real crux of my question was about comparing DataFrames". All solutions present at the moment are showing ways to compare DataFrames. Validating them is a slightly different question. Maybe I'm being a bit pedantic. – Britton 11/10, 2013 at 16:37

This gives a FutureWarning when the objects are not equal: print assert_frame_equal(X, df) gives output None and warning

/usr/local/lib/python2.7/site-packages/numpy/core/numeric.py:2367: FutureWarning: numpy equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (

is)) and will change. return bool(asarray(a1 == a2).all()) – Loxodromic 12/3, 2015 at 15:12

@Loxodromic Interesting, I expect this will be fixed in pandas 0.16. (out soon) – Swab 14/3, 2015 at 0:59

@Loxodromic which version of numpy/pandas are you using? I just updated to latest numpy (1.9.2) and do not see this message. Pandas 0.15.2. – Swab 14/3, 2015 at 8:39

It appears that check_names=False ignores neither column nor index differences. Even with it set I get AssertionError: DataFrame.index are different and when I set the same index I still get AssertionError: DataFrame.columns are different. Only if I set the same index and columns then then everything is ok. Not sure what check_names is supposed to control. – Panter 30/10, 2015 at 22:27

@river_jones check_names compares the df.index.names and df.columns.names attributes – Swab 30/10, 2015 at 22:49

One of the properties of NaN is that NaN != NaN is True.

Check out this answer for a nice way to do this using numexpr.

(a == b) | ((a != a) & (b != b))

says this (in pseudocode):

a == b or (isnan(a) and isnan(b))

So, either a equals b, or both a and b are NaN.

If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.

In [1]: df = DataFrame(rand(1e7, 15))

In [2]: df = df[df > 0.5]

In [3]: df2 = df.copy()

In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 14
dtypes: float64(15)

In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
1 loops, best of 3: 598 ms per loop

timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:

In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
1 loops, best of 3: 687 ms per loop

Britton answered 11/10, 2013 at 16:16 Comment(7)

presumably after you need to .all().all() ? – Swab 11/10, 2013 at 16:36

Much faster to use np.all() since this comparison guarantees a boolean result. It adds another ~100 ms to the running time. If you use the all() methods of pandas objects there's about a 3.5x perf hit. – Britton 11/10, 2013 at 16:40

"Guarantees" meaning I'm assuming that someone is not using a subclass that could return NaN from __eq__/__ne__. – Britton 11/10, 2013 at 16:41

you can do ((df == df2) | ((df != df) & (df2 != df2))).values.ravel().all() – Lovelovebird 11/10, 2013 at 16:43

I'll add that to the timeit. – Britton 11/10, 2013 at 16:44

BTW, no need to ravel if no axis is passed. – Britton 11/10, 2013 at 16:44

no because best and worst case are equal, all values are checked in either case – Britton 11/10, 2013 at 16:46

Like @PhillipCloud answer, but more written out

In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])

In [27]: df2 = df1.copy()

They really are equivalent

In [28]: result = df1 == df2

In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [30]: result
Out[30]: 
      0     1
0  True  True
1  True  True

A nan in df2 that doesn't exist in df1

In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])

In [32]: result = df1 == df2

In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [34]: result
Out[34]: 
       0     1
0   True  True
1  False  True

You can also fill with a value you know not to be in the frame

In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]: 
      0     1
0  True  True
1  True  True

Lovelovebird answered 11/10, 2013 at 16:19 Comment(2)

If you have two non-equal values I think this will give True since they are both nonnull :s – Swab 11/10, 2013 at 16:38

@Andy Hayden is correct: result[pd.isnull(df1) == pd.isnull(df2)] = True should be result[pd.isnull(df1) & pd.isnull(df2)] = True – Switch 4/12, 2020 at 0:12

Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.

Simply, df1.fillna('NULL') == df2.fillna('NULL'), if 'NULL' is not a value in the original data.

To be safe, do the following:

Example a) Compare two dataframes with NaN values

bools = (df1 == df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = True
assert bools.all().all()

Example b) Filter rows in df1 that do not match with df2

bools = (df1 != df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = False
df_outlier = df1[bools.all(axis=1)]

(Note: this is wrong - bools[pd.isnull(df1) == pd.isnull(df2)] = False)

Easley answered 20/8, 2015 at 16:32 Comment(2)

Your answer would look better if you add a bit more text to your answer, maybe explaining what you are doing. – Heterochromatin 20/8, 2015 at 19:4

This strategy is faster than the accepted answer for small cases, but much slower for large ones – Zipangu 19/5, 2020 at 22:45

df.fillna(0) == df2.fillna(0)

You can use fillna(). Documenation here.

from pandas import DataFrame

# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df

# comparison fails!
print df == df2

# all is well 
print df.fillna(0) == df2.fillna(0)

Wood answered 1/10, 2015 at 18:8 Comment(1)

0 feels like a dangerous thing to fill with in general, but filling with some unique string or identifier sounds like a good strategy. This strategy is faster than the accepted answer for small cases, but much slower for large ones – Zipangu 19/5, 2020 at 22:45

Ever since pandas 1.1.0 there is df.compare(expect), which provides detailed results if any differences occur.

Circumcision answered 29/6, 2023 at 9:15 Comment(0)

Michel de Ruiter's answer should be the current answer. Posting an answer because I don't have enough reputation to add a comment to it. Use the dataframe compare method.

df.compare(expect) will return an empty dataframe (all axis are 0-length) if df and expect are equal, including NaN locations. You can use the empty dataframe property on the result.

assert(df.compare(expect).empty))

Latitudinarian answered 21/7, 2023 at 18:5 Comment(1)

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review – Maddox 25/7, 2023 at 9:39

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags