NaNs as key in dictionaries
Asked Answered
J

2

26

Can anyone explain the following behaviour to me?

>>> import numpy as np
>>> {np.nan: 5}[np.nan]
5
>>> {float64(np.nan): 5}[float64(np.nan)]
KeyError: nan

Why does it work in the first case, but not in the second? Additionally, I found that the following DOES work:

>>> a ={a: 5}[a]
float64(np.nan)
Juror answered 22/6, 2011 at 14:47 Comment(1)
this will be always true: float('nan') != float('nan')Bolognese
F
41

The problem here is that NaN is not equal to itself, as defined in the IEEE standard for floating point numbers:

>>> float("nan") == float("nan")
False

When a dictionary looks up a key, it roughly does this:

  1. Compute the hash of the key to be looked up.

  2. For each key in the dict with the same hash, check if it matches the key to be looked up. This check consists of

    a. Checking for object identity: If the key in the dictionary and the key to be looked up are the same object as indicated by the is operator, the key was found.

    b. If the first check failed, check for equality using the __eq__ operator.

The first example succeeds, since np.nan and np.nan are the same object, so it does not matter they don't compare equal:

>>> numpy.nan is numpy.nan
True

In the second case, np.float64(np.nan) and np.float64(np.nan) are not the same object -- the two constructor calls create two distinct objects:

>>> numpy.float64(numpy.nan) is numpy.float64(numpy.nan)
False

Since the objects also do not compare equal, the dictionary concludes the key is not found and throws a KeyError.

You can even do this:

>>> a = float("nan")
>>> b = float("nan")
>>> {a: 1, b: 2}
{nan: 1, nan: 2}

In conclusion, it seems a saner idea to avoid NaN as a dictionary key.

Flita answered 22/6, 2011 at 14:57 Comment(7)
The last statement deserves much more emphasis.Warmblooded
Is there a guarantee that all float('nan') have the same memory location, i.e., that float('nan') is a singleton? Without it, even using plain float('nan') is a bad idea. Same question about np.nan.Interjoin
@max: Each call to float('nan') produces a new float instance, just as each call to float(1) creates a new float instance. This isn't by itself a bad thing. np.nan is a global name in the NumPy module, and will point to the same object as long as you don't reassign it, so under normal circumstances np.nan is a single value. (I wouldn't call it a singleton, since this name is ised for a class that only allows a single instance, like NoneType.)Flita
@SvenMarnach Ah that makes sense. So basically, it's only safe to use nan and a dict if both the storage and lookup is done with np.nan. The "wild" nans, like the ones created by float('nan') or by float('inf') - float('inf'), will not work well as dictionary keys.Interjoin
Just realised this happens from a related question, and playing around with it, I've realised that the following work around removes the key: Say your dict is d, then to remove then NaN key: for k in d.keys():, if k!=k: del d[k]; break. Any idea of why this removes the key but directly d[float('nan')] fails? NaNs with the same id also do not equal to themselves.Drawers
@Drawers The reason is the one given in this answer, specifically steps 2.a and 2.b. A NaN can only be looked up when you provide the same object again, since then Python assumes identity by object identity. If you create a new NaN, the object identity check fails, so a __eq__ comparison is triggered, returning False, so the keys are not considered identical.Flita
Right, makes sense now. Thanks a tonDrawers
V
4

Please note this is not the case anymore in Python 3.6:

>>> d = float("nan") #object nan
>>> d
nan
>>> c = {"a": 3, d: 4}
>>> c["a"]
3
>>> c[d]
4

In this example c is a dictionary that contains the value 3 associated to the key "a" and the value 4 associated to the key NaN.

The way Python 3.6 internally looks up in the dictionary has changed. Now, the first thing it does is compare the two pointers that represent the underlying variables. If they point to the same object, then the two objects are considered the same (well, technically we are comparing one object with itself). Otherwise, their hash is compared, if the hash is different, then the two objects are considered different. If at this point the equality of the objects has not been decided, then their comparators are called (they are "manually" compared, so to speak).

This means that although IEEE754 specifies that NAN isn't equal to itself:

>>> d == d
False

When looking up a dictionary, the underlying pointers of the variables are the first thing to be compared. Because these they point to the same object NaN, the dictionary returns 4.

Note also that not all NaN objects are exactly the same:

>>> e = float("nan")
>>> e == d
False
>>> c[e]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: nan
>>> c[d]
4

So, to summarize. Dictionaries prioritize performance by trying to compare if the underlying objects are the same. They have hash comparison and comparisons as fallback. Moreover, not every NaN represents the same underlying object.

One has to be very careful when dealing with NaNs as keys to dictionaries, adding such a key makes the underlying value impossible to reach unless you depend on the property described here. This property may change in the future (somewhat unlikely, but possible). Proceed with care.

Valorize answered 13/2, 2018 at 15:33 Comment(1)
This seems to fail again on python 3.8Drawers

© 2022 - 2024 — McMap. All rights reserved.