Python: sort function breaks in the presence of nan
Asked Answered
N

8

53

sorted([2, float('nan'), 1]) returns [2, nan, 1]

(At least on Activestate Python 3.1 implementation.)

I understand nan is a weird object, so I wouldn't be surprised if it shows up in random places in the sort result. But it also messes up the sort for the non-nan numbers in the container, which is really unexpected.

I asked a related question about max, and based on that I understand why sort works like this. But should this be considered a bug?

Documentation just says "Return a new sorted list [...]" without specifying any details.

EDIT: I now agree that this isn't in violation of the IEEE standard. However, it's a bug from any common sense viewpoint, I think. Even Microsoft, which isn't known to admit their mistakes often, has recognized this one as a bug, and fixed it in the latest version: http://connect.microsoft.com/VisualStudio/feedback/details/363379/bug-in-list-double-sort-in-list-which-contains-double-nan.

Anyway, I ended up following @khachik's answer:

sorted(list_, key = lambda x : float('-inf') if math.isnan(x) else x)

I suspect it results in a performance hit compared to the language doing that by default, but at least it works (barring any bugs that I introduced).

Nataline answered 21/11, 2010 at 20:5 Comment(5)
Not a Number(NAN) is invalid input for numerical sort, or anything expecting numbers; so I wouldn't consider it a bug.Thesaurus
@Frayser: that's not quite correct. Is it invalid in Python? No because Python doesn't raise exceptions. Is it invalid in IEEE754? No because it provides for very specific behavior (for quiet nan at least). Is it invalid in some other standard?Nataline
While it's understandable that "nan" will end up randomly somewhere in the resulting list, what's harder to understand is that it apparently is correct behavior to incorrectly order the numerical values still in the last: sorted([1.0, 2.0, 3.0, float('nan'), 4.0, 3.0, 2.0, 1.0]) => [1.0, 2.0, 3.0, nan, 1.0, 2.0, 3.0, 4.0]. See bugs.python.org/issue12286.Unconditional
"But it also messes up the sort for the non-nan numbers in the container, which is really unexpected." - exactly - but, thinking the problem was with .sort(), I only got to this Q&A once I had already figured it out :\ Thanks for recording it though!Tempest
@Unconditional As of 2019 that issue thread is closed :(Tempest
T
18

The previous answers are useful, but perhaps not clear regarding the root of the problem.

In any language, sort applies a given ordering, defined by a comparison function or in some other way, over the domain of the input values. For example, less-than, a.k.a. operator <, could be used throughout if and only if less than defines a suitable ordering over the input values.

But this is specifically NOT true for floating point values and less-than: "NaN is unordered: it is not equal to, greater than, or less than anything, including itself." (Clear prose from GNU C manual, but applies to all modern IEEE754 based floating point)

So the possible solutions are:

  1. remove the NaNs first, making the input domain well defined via < (or the other sorting function being used)
  2. define a custom comparison function (a.k.a. predicate) that does define an ordering for NaN, such as less than any number, or greater than any number.

Either approach can be used, in any language.

Practically, considering python, I would prefer to remove the NaNs if you either don't care much about fastest performance or if removing NaNs is a desired behavior in context.

Otherwise you could use a suitable predicate function via "cmp" in older python versions, or via this and functools.cmp_to_key(). The latter is a bit more awkward, naturally, than removing the NaNs first. And care will be required to avoid worse performance, when defining this predicate function.

Teide answered 23/8, 2011 at 17:35 Comment(5)
IEEE 754 requires that max(NaN, 1) returns 1. It would be nice if Python followed the standard, but it doesn't. If it follows its own rules, it could at least have some reasonable rules, rather than random unstable behavior.Nataline
To clarify, I do agree with you that float('nan') < 1 or float('nan') >= 1 should return False. It appears that an exception has been made in the newest IEEE standard (IEEE 754 = IEEE 754-2008) for functions minimum and maximum (which must return the number) but not for sort or regular comparison.Nataline
cmp_to_key is a pretty roundabout soloution. All you really need is a key function that replaces NaN with something else (e.g. an infinity or a custom object that compares less than anything).Mariettamariette
"define a custom comparison function (a.k.a. predicate) that does define an ordering for NaN, such as less than any number, or greater than any number." is insufficient as it does not define the case of comparing 2 NANs. Since there are multiple encodings for NANs, some consistent <,==,> compare is needed for 2 NANs. Perhaps a memory bit compare in this sub-case.Solanum
As far as I can tell, this answer doesn't explain why the non-nan numbers are randomly placed in the supposedly sorted list. I understand the problem with NaNs, and I don't care where they land. I do care about the relative position of 1 and 2 in a "sorted" list, though.Belaud
F
10

I'm not sure about the bug, but the workaround may be the following:

sorted(
    (2, 1, float('nan')),
    lambda x,y: x is float('nan') and -1 
                or (y is float('nan') and 1
                or cmp(x,y)))

which results in:

('nan', 1, 2)

Or remove nans before sorting or anything else.

Fussy answered 21/11, 2010 at 20:12 Comment(2)
I'll rewrite this for Python 3, and handle cases where nan is numpy.nan.Nataline
I suspect this fails when list has 2 NANs. Many sort routines fails when cmp(n1, n2) is -1 and cmp(n2, n1) is also -1.Solanum
D
8

The problem is that there's no correct order if the list contains a NAN, since a sequence a1, a2, a3, ..., an is sorted if a1 <= a2 <= a3 <= ... <= an. If any of these a values is a NAN then the sorted property breaks, since for all a, a <= NAN and NAN <= a are both false.

Destinee answered 21/11, 2010 at 21:46 Comment(1)
IEEE defines two different orders: a partial order where NaNs are incomparable and positive and negative zeros are equal; and a total linear order that allows comparing any two floating point values, including NaNs with different payloads. It'd be more practical if python sorted floating point values using the later order.Ardell
S
8

Assuming you want to keep the NaNs and order them as the lowest "values", here is a workaround working both with non-unique nan, unique numpy nan, numerical and non numerical objects:

def is_nan(x):
    return (x is np.nan or x != x)

list_ = [2, float('nan'), 'z', 1, 'a', np.nan, 4, float('nan')]
sorted(list_, key = lambda x : float('-inf') if is_nan(x) else x)
# [nan, nan, nan, 1, 2, 4, 'a', 'z']
Serica answered 24/5, 2017 at 9:25 Comment(2)
I like this answer. I can't understand why nan cannot be defined either as -inf or inf. I understand mathematically how one cannot compare 0 and 1/0 for instance, but that shouldn't stand in the way of a sensible language construct to handle this.Lumpish
@Lumpish If nan is like -inf, then a list as [nan, -inf, nan, -inf] would be considered sorted.Solanum
S
4

IEEE754 is the standard that defines floating point operations in this instance. This standard defines the compare operation of operands, at least one of which is a NaN, to be an error. Hence, this is not a bug. You need to deal with the NaNs before operating on your array.

Sumptuary answered 21/11, 2010 at 20:10 Comment(13)
-1 Python doesn't follow IEEE754, which requires that there are two NaNs: signalling and non-signalling, and two comparison operators: signalling and non-signalling. Furthermore, IEEE754-2008 specifically requires that max return the number when compared to nan.Nataline
If it is a signaling NaN (sNaN) then an exception will be raised by the hardware. For a quiet NaN (qNaN) the hardware won't raise an exception and it would be far too burdonsome to expect every library routine that dealt with floating point values to check for qNaNs.Sumptuary
If you are running CPython on a machine whose FP hardware is based on IEEE754 then that's what you will get. Also, in what sense does IEEE754 define max?Sumptuary
Sorry, I accidentally overwrote my original comment. It was asking "Why wouldn't Python raise an exception then", and I agree with your answer to that question. I just disagree that IEEE754 is relevant in this case at all.Nataline
It's relevant because the hardware on which you are running this code implements IEEE754 floating point, I would guess. Or do you have some more information that you care to share? If you want decent performance then you simply need to pass all floating point ops to the hardware. If the Python interpreter/libraries/extensions were to check for NaNs in a systematic way then the performance hit would be terrible.Sumptuary
The Python documentation has this to say about IEEE754: "Almost all machines today (July 2010) use IEEE-754 floating point arithmetic, and almost all platforms map Python floats to IEEE-754 “double precision”." Also, thanks a lot for the down votes. Just because you don't like the answer doesn't mean you should shoot the messenger!! ;-)Sumptuary
@David Heffernan: I can't find the reference, but in reading about it, it seems that it just says how max should work with quiet NaNs.Nataline
@Nataline we can argue all we like, but it is what it is and you'll just have to pre-process the array and check for the NaNs - if you don't like how it's done then you'll have to take it up with Guido!!!Sumptuary
@David Heffernan: if you edit a question to clarify that Python doesn't follow IEEE 754, I would remove my downvote. I just don't see how it is the standard relevant given that it's violated by pretty much every math function in Python that contains nan.Nataline
I'll double check it later and get back to you. Sorry in advance if I'm wrong.Nataline
CPython has done a bit of work on this sort of thing in more recent releases, e.g. 2.6. Have a read of the file header comment in the CPython source code for mathmodule.c. Quoting, "In general, on an IEEE-754 platform the aim is to follow the C99 standard, including Annex 'F', whenever possible..."Iodoform
@DavidHeffernan I now agree with you that IEEE doesn't specify anything about the sort behavior, so Python is free to as it will. (Although min and max functions do strictly speaking violate the standard.) I'll remove the downvote once you make any edit to the question (otherwise the system won't let me).Nataline
I see no standard violations here.Sumptuary
P
2

To recap the problems:

NaN

This always returns False for every comparison, so it stays where it is in the list:

>>> sorted([float('nan'), 0])
[nan, 0]
>>> sorted([0, float('nan')])
[0, nan]

-0.0

This is == to 0.0, but has a different repr, a different json representation, and slightly different numerical properties. It has the same problem that the positive and negative zeros will stay in the same order they were in the original list:

>>> sorted([0.0, -0.0])
[0.0, -0.0]
>>> sorted([-0.0, 0.0])
[-0.0, 0.0]

Other solution?

@khachik's solution has inconsistent sorting behavior for NaN and -inf

>>> key=lambda x: float('-inf') if math.isnan(x) else x
>>> sorted([float('nan'), float('-inf')], key=key)
[nan, -inf]
>>> sorted([float('-inf'), float('nan')], key=key)
[-inf, nan]

Solution: more complex key function.

So there are problems with signs and nans. We can just include those in a key function:

def stable_float_sort_key(x: float):
    return math.copysign(1, x), math.isnan(x), x

That works on all the examples above:

>>> sorted([float('nan'), 0.0], key=stable_float_sort_key)
[0.0, nan]
>>> sorted([0.0, float('nan')], key=stable_float_sort_key)
[0.0, nan]
>>> sorted([float('nan'), float('-inf')], key=stable_float_sort_key)
[-inf, nan]
>>> sorted([float('-inf'), float('nan')], key=stable_float_sort_key)
[-inf, nan]
>>> sorted([0.0, -0.0], key=stable_float_sort_key)
[-0.0, 0.0]
>>> sorted([-0.0, 0.0], key=stable_float_sort_key)
[-0.0, 0.0]

Indeed, you can write a hypothesis test showing it's consistent across all floating point numbers:

import json
from hypothesis import given, settings
from hypothesis import strategies as st

@given(nums=st.lists(st.floats()), random=st.randoms())
@settings(max_examples=10000)
def test_stable_json_sorting(nums, random):
    shuffled = list(nums)
    random.shuffle(shuffled)
    l1 = sorted(nums, key=stable_float_sort_key)
    l2 = sorted(shuffled, key=stable_float_sort_key)
    assert json.dumps(l1) == json.dumps(l2)

It does have some oddities however, since some NaN's are negative! For example:

>>> sorted([float('nan'), -0.0, 0.0, float('-nan')], key=stable_float_sort_key)
[-0.0, nan, 0.0, nan]

If that bothers you, you can fix this by switching the ordering:

def stable_float_sort_key(x: float):
    return math.isnan(x), math.copysign(1, x), x

That sorts negative numbers first, followed by positive numbers, followed by NaNs.

Does any of this make sense?

Of course other answerers are correct that in some sense none of this makes sense. Comparison of NaNs is a conceptual error of some kind. However, even in cases where the question doesn't "make sense", you may want to have invariants like serializing sets of floating point numbers generated by the same code to exactly the same JSON representation, despite hash randomization (my use case). That's more of a formal property of python code, rather than something where there's a "right answer" per an IEEE standard.

Palmer answered 18/11, 2021 at 8:9 Comment(0)
Q
0

Regardless of standards, there are many cases where a user-defined ordering of float and NA values is useful. For instance, I was sorting stock returns and wanted highest to lowest with NA last (since those were irrelevant). There are 4 possible combinations

  1. Ascending float values, NA values last
  2. Ascending float values, NA values first
  3. Descending float values, NA values last
  4. Descending float values, NA values first

Here is a function that covers all scenarios by conditionally replacing NA values with +/- inf

import math 

def sort_with_na(x, reverse=False, na_last=True):
    """Intelligently sort iterable with NA values

    For reliable behavior with NA values, we should change the NAs to +/- inf
    to guarantee their order rather than relying on the built-in
    ``sorted(reverse=True)`` which will have no effect. To use the ``reverse``
    parameter or other kwargs, use functools.partial in your lambda i.e.

        sorted(iterable, key=partial(sort_with_na, reverse=True, na_last=False))

    :param x: Element to be sorted
    :param bool na_last: Whether NA values should come last or first
    :param bool reverse: Return ascending if ``False`` else descending
    :return bool:
    """
    if not math.isnan(x):
        return -x if reverse else x
    else:
        return float('inf') if na_last else float('-inf')

Testing out each of the 4 combinations

from functools import partial

a = [2, float('nan'), 1]
sorted(a, key=sort_with_na)                                         # Default
sorted(a, key=partial(sort_with_na, reverse=False, na_last=True))   # Ascend, NA last
sorted(a, key=partial(sort_with_na, reverse=False, na_last=False))  # Ascend, NA first
sorted(a, key=partial(sort_with_na, reverse=True, na_last=True))    # Descend, NA last
sorted(a, key=partial(sort_with_na, reverse=True, na_last=False))   # Descend, NA first
Quincey answered 21/11, 2020 at 16:9 Comment(0)
S
0

A resilient sort involves a compare of 2 items and returning: less, equal, greater.

If cmp(a,b) is "greater", then cmp(b,a) must be "less".

If cmp(a,b) is "zero", then cmp(b,a) must be "zero".

What is missing in answers to date is the case of comparing 2 floats which are both NANs and preserving the above properties. 2 NANs should compare as equal or perhaps based on some consistent interpretation of their payloads.

Alternate compare algorithm to put all NAN > +inf

if isnan(a)
  if isnan(b)
    return 0 (or maybe compare payloads/bit patterns)
  return 1
if isnan(b) return 1
if a > b return 1
if a < b return -1
return 0
Solanum answered 19/3, 2021 at 10:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.