How to properly compare an integer and a floating-point value?

Asked 6/11, 2019 at 15:55 Answered 10/11, 2019 at 11:31

How do I compare an integer and a floating-point value the right way™?

The builtin comparsion operators give incorrect results in some edge cases, for example:

#include <iomanip>
#include <iostream>

int main()
{
    long long a = 999999984306749439;
    float     b = 999999984306749440.f; // This number can be represented exactly by a `float`.

    std::cout << std::setprecision(1000);
    std::cout << a << " < " << b << " = " << (a < b) << '\n';
    // Prints `999999984306749439 < 999999984306749440 = 0`, but it should be `1`.
}

Apparently, the comparsion operators convert both operands to a same type before actually comparing them. Here lhs gets converted to float, which causes a loss of precision, and leads to an incorrect result.

Even though I understand what's going on, I'm not sure how to work around this issue.

^{Disclaimer: The example uses a float and a long long, but I'm looking for a generic solution that works for every combination of an integral type and a floating-point type.}

Beulahbeuthel answered 6/11, 2019 at 15:55 Comment(22)

Depends what you want to do. Define the sort of "comparison" you're after. You cannot solve an unspecified problem. – Centner 6/11, 2019 at 16:1

Convert the floating point type into the integer type? – Sundae 6/11, 2019 at 16:2

My first idea would be to cast the float to a long long. – Resonator 6/11, 2019 at 16:2

@NathanOliver-ReinstateMonica But what if the integer type is not large enough? I'm not even sure how to check if it's large enough. – Beulahbeuthel 6/11, 2019 at 16:2

@LightnessRaceswithMonica I'm trying to check if one number is less than the other number, in the mathematical sense. – Beulahbeuthel 6/11, 2019 at 16:3

@Beulahbeuthel To what level of precision? To the level of all significant figures given in the respective initialisers? To something else? – Centner 6/11, 2019 at 16:4

Okay, convert both to a long double. Can't really break anything with that although you may lose some precision on the integer but that's to be expected when working with floating point types. – Sundae 6/11, 2019 at 16:5

How are you getting so many usable decimal places from a float o.O – Centner 6/11, 2019 at 16:5

@LightnessRaceswithMonica "How are you getting so many usable decimal places" By carefully picking the float. :P – Beulahbeuthel 6/11, 2019 at 16:6

@Beulahbeuthel lol ok- well that's another important point: will you be carefully picking all your other floats? Is that a prerequisite for your desired algorithm? There are loads of gaps in the requirements here – Centner 6/11, 2019 at 16:6

@LightnessRaceswithMonica Answering the question about precision: Every integer variable and every floating-point variable (unless it's a NaN or infinity) represents a specific number, which is finite and has a finite amount of digits after the decimal point. I want to compare those represented numbers. That means, with full precision (that is, without discarding 'garbage' digits from floats). – Beulahbeuthel 6/11, 2019 at 16:9

@uneven_mark What do you mean? Do you say I should specify if I used IEEE floats or not, or that I should break down that float to the mantissa & exponent and add it to the question? – Beulahbeuthel 6/11, 2019 at 16:15

@LightnessRaceswithMonica: OP picked 999999984306749440 which is an exact IEEE754 32 bit float, although the space is sparse in that region. The next one differs from it by the GDP of a small country. – Different 6/11, 2019 at 16:28

@Different Thank you we did cover that – Centner 6/11, 2019 at 16:34

@LightnessRaceswithMonica: I'll get me coat. – Different 6/11, 2019 at 16:36

This works: float c = a; std::cout << a << " < " << b << " = " << ((a-(long long)c) < (b-c)) << '\n'; It should be a generic fix to "implicit conversion lost precision". "One of the values is out of range for the other datatype" is a different problem that needs a different solution. – Drawn 6/11, 2019 at 16:46

@Different You've pulled? 😂 – Centner 6/11, 2019 at 16:59

@LightnessRaceswithMonica: Not since May 1996. Wouldn't know where to start these days. – Different 6/11, 2019 at 17:0

@Different Gosh good luck – Centner 6/11, 2019 at 17:2

@LightnessRaceswithMonica: Yup I married young! – Different 6/11, 2019 at 17:16

possible duplicate of Comparing uint64_t and float for numeric equivalence, Compare a 32 bit float and a 32 bit integer without casting to double, when either value could be too large to fit the other type exactly – Ulaulah 19/11, 2019 at 13:6

@Ulaulah Those are interesting, but I don't think mine is a dupe. The first one asks specifically about equality, and the second one only asks about two specific types. – Beulahbeuthel 19/11, 2019 at 14:18

Here's what I ended up with.

Credit for the algorithm goes to @chux; his approach appears to outperform the other suggestions. You can find some alternative implementations in the edit history.

If you can think of any improvements, suggestions are welcome.

#include <compare>
#include <cmath>
#include <limits>
#include <type_traits>

template <typename I, typename F>
std::partial_ordering compare_int_float(I i, F f)
{
    if constexpr (std::is_integral_v<F> && std::is_floating_point_v<I>)
    {
        return 0 <=> compare_int_float(f, i);
    }
    else
    {
        static_assert(std::is_integral_v<I> && std::is_floating_point_v<F>);
        static_assert(std::numeric_limits<F>::radix == 2);
        
        // This should be exactly representable as F due to being a power of two.
        constexpr F I_min_as_F = std::numeric_limits<I>::min();
        
        // The `numeric_limits<I>::max()` itself might not be representable as F, so we use this instead.
        constexpr F I_max_as_F_plus_1 = F(std::numeric_limits<I>::max()/2+1) * 2;

        // Check if the constants above overflowed to infinity. Normally this shouldn't happen.
        constexpr bool limits_overflow = I_min_as_F * 2 == I_min_as_F || I_max_as_F_plus_1 * 2 == I_max_as_F_plus_1;
        if constexpr (limits_overflow)
        {
            // Manually check for special floating-point values.
            if (std::isinf(f))
                return f > 0 ? std::partial_ordering::less : std::partial_ordering::greater;
            if (std::isnan(f))
                return std::partial_ordering::unordered;
        }

        if (limits_overflow || f >= I_min_as_F)
        {
            // `f <= I_max_as_F_plus_1 - 1` would be problematic due to rounding, so we use this instead.
            if (limits_overflow || f - I_max_as_F_plus_1 <= -1)
            {
                I f_trunc = f;
                if (f_trunc < i)
                    return std::partial_ordering::greater;
                if (f_trunc > i)
                    return std::partial_ordering::less;
            
                F f_frac = f - f_trunc;
                if (f_frac < 0)
                    return std::partial_ordering::greater;
                if (f_frac > 0)
                    return std::partial_ordering::less;
                    
                return std::partial_ordering::equivalent;
            }

            return std::partial_ordering::less;
        }
        
        if (f < 0)
            return std::partial_ordering::greater;
        
        return std::partial_ordering::unordered;
    }
}

If you want to experiment with it, here are a few test cases:

#include <algorithm>
#include <cmath>
#include <iomanip>
#include <iostream> 

void compare_print(long long a, float b, int n = 0)
{
    if (n == 0)
    {
        auto result = compare_int_float(a,b);
        static constexpr std::partial_ordering values[] = {std::partial_ordering::less, std::partial_ordering::equivalent, std::partial_ordering::greater, std::partial_ordering::unordered};
        std::cout << a << ' ' << "<=>?"[std::find(values, values+4, result) - values] << ' ' << b << '\n';
    }
    else
    {
        for (int i = 0; i < n; i++)
            b = std::nextafter(b, -INFINITY);
            
        for (int i = 0; i <= n*2; i++)
        {
            compare_print(a, b);
            b = std::nextafter(b, INFINITY);
        }
        
        std::cout << '\n';
    }
}

int main()
{    
    std::cout << std::setprecision(1000);
    
    compare_print(999999984306749440,
                  999999984306749440.f, 2);
                  
    compare_print(999999984306749439,
                  999999984306749440.f, 2);
                  
    compare_print(100,
                  100.f, 2);
    
    compare_print(-100,
                  -100.f, 2);
                  
    compare_print(0,
                  0.f, 2);
                                    
    compare_print((long long)0x8000'0000'0000'0000,
                  (long long)0x8000'0000'0000'0000, 2);
                                    
    compare_print(42, INFINITY);
    compare_print(42, -INFINITY);
    compare_print(42, NAN);
    std::cout << '\n';

    compare_print(1388608,
                  1388608.f, 2);
    
    compare_print(12388608,
                  12388608.f, 2);
}

^{(run the code)}

Beulahbeuthel answered 6/11, 2019 at 20:17 Comment(12)

Very nice indeed Sir! – Different 6/11, 2019 at 20:27

Looks good, but you can suppress the result_if_eq: (F) i will never have a fractional part: either F has more precision than I, then (F) i conversion is exact (no fraction part) or I has more bits than F precision, but in this case (F) i cannot have a fraction part either (the ulp is > 1). – Volny 7/11, 2019 at 15:1

Since you are in the case (F) i == f, if (F) i does not have a fraction part, neither has f – Volny 7/11, 2019 at 15:54

@Volny You're right, I didn't think about that. Edited. – Beulahbeuthel 7/11, 2019 at 15:54

You can also safely remove the isinf(f) test, this is already covered by (F) i < f and (F) i > f tests, unless the integer type has so many bits that it could overflow max float (unlikely, but you could eventually cover this condition with a static assertion) – Volny 7/11, 2019 at 19:28

There are other implementation defined optimizations that could be covered by static assertions: assuming two complement, and F radix = 2, then INTEGER_MIN magnitude is a power of two and (I) (F) INTEGER_MIN cannot overflow (it is exact). Thus no need to cover float_too_large condition for negative f. BTW, did you test INTEGER_MIN case? – Volny 7/11, 2019 at 19:47

@Volny Regarding the isinf(f) test - good point, I've put it behind a constexpr condition. Regarding INTEGER_MIN - I don't quite get what changes you suggest. I've added it to the test cases in the answer for the convenience. (At first I though you're suggesting removing the + (std::is_signed part, but doing it gives an incorrect result.) – Beulahbeuthel 7/11, 2019 at 19:58

@Beulahbeuthel I mean that integer overflow in (I) f can only happen when f == (F) INTEGER_MAX, not when f == (F) INTEGER_MIN under above assumptions, so we need to care only of former case. Beside INTEGER_MIN and INTEGER_MAX generally differ in 1 bit of magnitude, so I'm unsure about current handling... – Volny 7/11, 2019 at 20:31

@Volny Hmm. Here are some more test cases. The current solution appears to work on them. – Beulahbeuthel 7/11, 2019 at 20:41

Ah OK, you harcoded the two-complement assumption via + (std::is_signed_v && frac == F(-0.5)), I see it now. The condition exp ==... cannot happen for numeric_limits::min() in case of two-complement AND numeric_limits::radix==numeric_limit<F>::radix, because (F) i == f and (F) numeric_limits::min() is then exact. Both implementation defined assumptions are implcit in your code because you compare float max exponent and int digits, which is only possible if sharing same radix... – Volny 7/11, 2019 at 21:59

I think (std::numeric_limits::max()/2+1) * 2; needs review to handle cases where range of I exceed F. Consider int128_t and float. In my code I cheated and used *2.0 to use double math when F is float. – Deach 8/11, 2019 at 16:10

@chux-ReinstateMonica I think I've handled that by adding the limits_overflow thing. I've not tested it though. – Beulahbeuthel 8/11, 2019 at 16:14

(Restricting this answer to positive numbers; generalisation is trivial.)

Get the number of bits in your exponent for the float on your platform along with the radix. If you have an IEEE754 32 bit float then this is a trivial step.
Use (1) to compute the largest non-integer value that can be stored in your float. std::numeric_limits doesn't specify this value, annoyingly, so you need to do this yourself. For 32 bit IEEE754 you could take the easy option: 8388607.5 is the largest non-integral type float.
If your float is less than or equal to (2), then check if it's an integer or not. If it's not an integer then you can round it appropriately so as not to invalidate the <.
At this point, the float is an integer. Check if it's within in the range of your long long. If it's out of range then the result of < is known.
If you get this far, then you can safely cast your float to a long long, and make the comparison.

Different answered 6/11, 2019 at 16:5 Comment(2)

Looks promising; I need to think about it for a bit. I guess I'll try to implement your idea in code and will post it as an answer. – Beulahbeuthel 6/11, 2019 at 16:39

@HolyBlackCat: Which I'd upvote, ping me when you've done it! Personally I'd collapse the first two steps to the simple 8388607.5 check. – Different 6/11, 2019 at 16:42

To compare a FP f and integer i for equality:

(Code is representative and uses comparison of float and long long as an example)

If f is a NaN, infinity, or has a fractional part (perhaps use frexp()), f is not equal to i.

float ipart;
// C++
if (frexp(f, &ipart) != 0) return not_equal;
// C
if (frexpf(f, &ipart) != 0) return not_equal;

Convert the numeric limits of i into exactly representable FP values (powers of 2) near those limits.^** Easy to do if we assume FP is not a rare base 10 encoding and range of double exceeds the range on the i. Take advantage that integer limits magnitudes are or near Mersenne Number. (Sorry example code is C-ish)
```
#define FP_INT_MAX_PLUS1 ((LLONG_MAX/2 + 1)*2.0)
#define FP_INT_MIN (LLONG_MIN*1.0)
```

Compare f to is limits

if (f >= FP_INT_MAX_PLUS1) return not_equal;
if (f < FP_INT_MIN) return not_equal;

Convert f to integer and compare
```
return (long long) f == i;
```

To compare a FP f and integer i for <, >, == or not comparable:

(Using above limits)

Test f >= lower limit
```
if (f >= FP_INT_MIN) {
```

Test f <= upper limit

  // reform below to cope with effects of rounding
  // if (f <= FP_INT_MAX_PLUS1 - 1)
  if (f - FP_INT_MAX_PLUS1 <= -1.0) {

Convert f to integer/fraction and compare

    // at this point `f` is in the range of `i`
    long long ipart = (long long) f;
    if (ipart < i) return f_less_than_i;
    if (ipart > i) return f_more_than_i;

    float frac = f - ipart;
    if (frac < 0) return f_less_than_i;
    if (frac > 0) return f_more_than_i;
    return equal;
  }

Handle edge cases

  else return f_more_than_i;
}
if (f < 0.0) return f_less_than_i;
return not_comparable;

Simplifications possible, yet I wanted to convey the algorithm.

^** Additional conditional code needed to cope with non 2's complement integer encoding. It is quite similar to the MAX code.

Deach answered 8/11, 2019 at 0:32 Comment(1)

I benchmarked it, and your approach seems to be faster than what I used to have in the answer (even after I removed frexp from mine). I've C++-ified it a bit and added it to my answer. – Beulahbeuthel 8/11, 2019 at 15:55

The code below works with integer data types of at most 64 bits and floating point data types of at most ieee-754 double precision accuracy. For wider data types the same idea can be used, but you'll have to adapt he code. Since I'm not very familiar with C++, the code is written in C. It shouldn't be too difficult to convert it to a C++ style code. The code is branchless, which might be a performance benefit.

#include <stdio.h>
// gcc -O3 -march=haswell cmp.c
// Assume long long int is 64 bits.
// Assume ieee-754 double precision.
int long_long_less_than_double(long long int i, double y) {
    long long i_lo = i & 0x00000000FFFFFFFF;   // Extract lower 32 bits.
    long long i_hi = i & 0xFFFFFFFF00000000;   // Extract upper 32 bits.
    double x_lo = (double)i_lo;                // Exact conversion to double, no rounding errors!
    double x_hi = (double)i_hi;                // 
    return ( x_lo < (y - x_hi) );              // If i is close to y then y - x_hi is exact,
                                               // due to Sterbenz' lemma.
    // i < y
    // i_lo +i_hi < y      
    // i_lo < (y - i_hi)
    // x_lo < (y - x_hi)
}

int long_long_equals_double(long long int i, double y) {
    long long i_lo = i & 0x00000000FFFFFFFF;   
    long long i_hi = i & 0xFFFFFFFF00000000;   
    double x_lo = (double)i_lo;                    
    double x_hi = (double)i_hi;                    
    return ( x_lo == (y - x_hi) );                  
}                                                  


int main()
{
    long long a0 = 999999984306749439;
    long long a1 = 999999984306749440;    // Hex number: 0x0DE0B6B000000000
    long long a2 = 999999984306749441;
    float     b = 999999984306749440.f;   // This number can be represented exactly by a `float`.

    printf("%lli less_than %20.1f = %i\n", a0, b, long_long_less_than_double(a0, b));  // Implicit conversion from float to double
    printf("%lli less_than %20.1f = %i\n", a1, b, long_long_less_than_double(a1, b));

    printf("%lli equals    %20.1f = %i\n", a0, b, long_long_equals_double(a0, b));
    printf("%lli equals    %20.1f = %i\n", a1, b, long_long_equals_double(a1, b));
    printf("%lli equals    %20.1f = %i\n\n", a2, b, long_long_equals_double(a2, b));


    long long c0 = 1311693406324658687;
    long long c1 = 1311693406324658688;   // Hex number: 0x1234123412341200
    long long c2 = 1311693406324658689; 
    double     d = 1311693406324658688.0; // This number can be represented exactly by a `double`.

    printf("%lli less_than %20.1f = %i\n", c0, d, long_long_less_than_double(c0, d));
    printf("%lli less_than %20.1f = %i\n", c1, d, long_long_less_than_double(c1, d));

    printf("%lli equals    %20.1f = %i\n", c0, d, long_long_equals_double(c0, d));
    printf("%lli equals    %20.1f = %i\n", c1, d, long_long_equals_double(c1, d));
    printf("%lli equals    %20.1f = %i\n", c2, d, long_long_equals_double(c2, d));


    return 0;
}

The idea is to split the 64 bits integer i in 32 upper bits i_hi and 32 lower bits i_lo, which are converted to doubles x_hi and x_lo without any rounding errors. If double y is close to x_hi, then the floating point subtraction y - x_hi is exact, due to Sterbenz' lemma. So, instead of x_lo + x_hi < y, we can test for x_lo < (y - x_hi), which is more accurate! If double y is not close to x_hi then y - x_hi is inacurate, but in that case we don't need the accuracy because then |y - x_hi| is much larger than |x_lo|. In other words: If i and y differ much than we don't have to worry about the value of the lower 32 bits.

Output:

    999999984306749439 less_than 999999984306749440.0 = 1
    999999984306749440 less_than 999999984306749440.0 = 0
    999999984306749439 equals    999999984306749440.0 = 0
    999999984306749440 equals    999999984306749440.0 = 1
    999999984306749441 equals    999999984306749440.0 = 0

    1311693406324658687 less_than 1311693406324658688.0 = 1
    1311693406324658688 less_than 1311693406324658688.0 = 0
    1311693406324658687 equals    1311693406324658688.0 = 0
    1311693406324658688 equals    1311693406324658688.0 = 1
    1311693406324658689 equals    1311693406324658688.0 = 0

Thereabout answered 10/11, 2019 at 11:31 Comment(5)

This is clever. It's the simplest algorithm so far, but it appears to be slightly slower than what @chux has suggested. – Beulahbeuthel 10/11, 2019 at 12:11

This also work for comparing long long to float: first convert float to double (exact) then compare long long to double... Same for shorter int type. – Volny 11/11, 2019 at 21:3

@aka.nice. Indeed that is what I meant. Calling long_long_less_than_double with a float argument converts the float to double. Thanks for the clarification. – Thereabout 16/11, 2019 at 9:39

@Beulahbeuthel Nice to see the benchmark! One possible advantage of my solution is that in principle SIMD vectorization, e.g. AVX2 wth intel intrinsics is possible, see also here. (Probably chux' solution is also well vectorizable) This means that you can do 4 compares simultaneously, but this will only be successful if it is possible to vectorize the surrounding code too. Note that in the real world the performance might be slightly worse or better than the micro benchmark suggests. – Thereabout 16/11, 2019 at 9:53

I think the code here produces the same result as your code. This code is still slower than chux' idea, but you can see that it is branchless, and therefore it is suitable for vectorization. This might lead to a significant speedup, if it is possible to vectorize the performance critical surrounding code too, as I wrote before. – Thereabout 16/11, 2019 at 10:5

This is how I solved it recently in opensmalltalk VM for comparing bounded integers:

convert the integer as floating point (values is rounded, thus maybe inexact)
compare if both float values are equal
if they are not, there is no ambiguity whatever the rounding error, thus perform the comparison of floating point values and return the result
if they are equal, then convert the floating point as integer and perform comparison of integer values

The last point may lead to a difficulty: the conversion floating point->integer might lead to an integer overflow. You must thus make sure that you use a larger integer type for that edge cases, or fallback to Bathseba's algorithm.

In OpenSmalltalk VM, that's not a problem because SmallInteger are on 61 bits at most, so I did not attempt to solve it.

I have a Smallissimo blog entry giving additional pointers:

How to compare exact value of SmallInteger and Float in Smalltalk

For unbounded (arbitrarily large) integers, the comparison is performed in Integer, but there are a few tricks to accelerate the comparison. This is not done in the VM but in Smalltalk code (Squeak is a good example).

Volny answered 7/11, 2019 at 6:56 Comment(0)

-3

Use double, not float. Take the double value + 0.5. Truncate it by static cast to long long. Now compare the two long longs.

Bareilly answered 6/11, 2019 at 16:6 Comment(6)

Fails on edge cases sadly. – Different 6/11, 2019 at 16:7

What is the value is too big to be represented by the integer type? – Sundae 6/11, 2019 at 16:7

even double will hardly hold this large numbers with enough precision. – Resonator 6/11, 2019 at 16:7

Just throwing more bits at the problem is not a good solution unless you have actually verified that it'll give you enough for all your possible inputs. There's no evidence of such reasoning here. – Centner 6/11, 2019 at 16:7

@NathanOliver-ReinstateMonica Then you know the comparison is false because the other operand couldn't have held the value – Centner 6/11, 2019 at 16:9

OK, sorry, I didn't suggest to vet the input. If the numbers are not guaranteed somehow to be reasonable, then by all means vet the input with use of std::numeric_limits. – Bareilly 6/11, 2019 at 16:23

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags