How many values can be represented in a range when using 64-bit floating point type in the most efficient manner

Asked 7/9 at 22:27 Answered 8/9 at 1:11

Given a 64-bit floating point type (ieee-754), and a closed range [R0,R1] assuming both R0 and R1 are within the representable range and are not NaN or +/-inf etc.

How does one calculate the number of representable values with in the range that is better than simply looping over the range?

std::uint64_t count = 0;
while (r0 <= r1) 
{
    count++;
    r0 = std::nextafterf(r0,std::numeric_limits<double>infinity());
}

Stellite answered 7/9 at 22:27 Comment(1)

C++: Calculate number of possible floating-point values within given range offers a good C++ solution that does not use a C union. – Catacaustic 22/10 at 13:59

How many values can be represented in a range when using floating point type

A cheat assumes the double matches the common IEEE 64-bit layout, the endian and size are like (un)signed long long or (u)int64_t.

Mapping (via a union) each successive double from -INF to +INF then every increasing integer value we have the below. Note that +0 and -0 are considered the same value there. If you want +0 and -0 to differ for this count change if (y.ll < 0) { y.ll = LLONG_MIN - y.ll; } to if (y.ll < 0) { y.ll = LLONG_MIN - y.ll - 1; } (or something like that).

The below also returns 0 on ULP_diff(x,x), so you might want to add 1 to capture both end points.

#include <assert.h>
#include <limits.h>

unsigned long long ULP_diff(double x, double y) {
  assert(sizeof(double) == sizeof(long long));
  union {
    double d;
    long long ll;
    unsigned long long ull;
  } ux, uy;
  ux.d = x;
  uy.d = y;
  if (ux.ll < 0) {
    ux.ll = LLONG_MIN - ux.ll;
  }
  if (uy.ll < 0) {
    uy.ll = LLONG_MIN - uy.ll;
  }
  unsigned long long diff;
  if (ux.ll >= uy.ll) {
    diff = ux.ull - uy.ull;
  } else {
    diff = uy.ull - ux.ull;
  }
  return diff;
}

The above function is very useful in assessing a rating on how close 2 double values are to each other.

Some additional comments

It is not a numeric conversion in the usual sense. It is more like a one-to-one bit-mapping of double to 64-bit signed integer. The 2^52 has nothing to do with it. With 64-bit double there are about 2^64 bit finite values. The union here maps the ascending double values to the int64_t values from a very negative value to up near INT64_MAX. Think of 0.0 --> 0, next_after(0, 1) --> 1 and so on up to DBL_MAX --> a integer value near INT64_MAX. and the negatives double map to negative int64_t.

The positive double encoding maps cleanly to positive int64_t. Negative double are more like sign-magnitude encoding and so need a x.ll = LLONG_MIN - x.ll; to flip around for int64_t 2's complement encoding.

.d (in code)  .d (decimal FP) .d (hex FP)             .ull               .ll
-INFINITY     -inf           -inf                     0x8010000000000000 -9218868437227405312
-DBL_MAX      -1.79769e+308  -0x1.fffffffffffffp+1023 0x8010000000000001 -9218868437227405311
-1.0          -1             -0x1p+0                  0xC010000000000000 -4607182418800017408
-DBL_MIN      -2.22507e-308  -0x1p-1022               0xFFF0000000000000    -4503599627370496
-DBL_TRUE_MIN -4.94066e-324  -0x1p-1074               0xFFFFFFFFFFFFFFFF                   -1
-0.0          -0             -0x0p+0                  0x0000000000000000                    0
+0.0           0              0x0p+0                  0x0000000000000000                    0
 DBL_TRUE_MIN  4.94066e-324   0x1p-1074               0x0000000000000001                    1
 DBL_MIN       2.22507e-308   0x1p-1022               0x0010000000000000     4503599627370496
 1.0           1              0x1p+0                  0x3FF0000000000000  4607182418800017408
 DBL_MAX       1.79769e+308   0x1.fffffffffffffp+1023 0x7FEFFFFFFFFFFFFF  9218868437227405311
 INFINITY      inf            inf                     0x7FF0000000000000  9218868437227405312

Catacaustic answered 8/9 at 1:11 Comment(6)

I don't think so as the larger numbers get the large the distance between any two numbers. all integers under i think 2^52 can be represented, but after than only every second integer and then after some point only every 4th integer. etc - interesting approach though. – Stellite 8/9 at 2:18

@PennyDreudter This will definitely work, and I don't know why I didn't think of it. IEEE-754 floating-point values are "monotonically increasing" with their bitwise values. If you have two floating-point values x and y, and if x < y, then you will also have rep(x) < rep(y), where rep() means "take the bits of the floating-point representation and treat them as an unsigned integer" (which is what chux's code is doing with that union). Moreover, if y = nextafter(x), then rep(y) will equal rep(x)+1. – Modulator 8/9 at 3:31

@SteveSummit A tad more detail: the positive double encoding maps cleanly to positive int64_t. Negative double are more like sign-magnitude encoding and so need a x.ll = LLONG_MIN - x.ll; to flip around for int64_t 2's complement encoding. – Catacaustic 8/9 at 3:36

@chux-ReinstateMonica the extended explanation in the comments makes sense, would be nice if you could add them to the answer. – Stellite 8/9 at 21:52

one more question: given there are many NaNs is your result assuming each of those NaNs is a unique value? – Stellite 8/9 at 23:34

@PennyDreudter Yes, each NAN maps to a different integer value. Yet this function really should have an early isNAN(x) or isNAN(y) test. IMO, return 0 if both are any NAN and return a max difference otherwise. – Catacaustic 9/9 at 0:46

A binary floating-point format such as IEEE-754 can be thought of as consisting of a number of binades. A binade is a complete set of significand values, all with the same power-of-two exponent. So the half-open interval [1.0, 2.0) is one binade, as is [2.0, 4.0), as is [4.0, 8.0).

For double precision, each binade consists of 2⁵² significand values.

So to compute the number of values between R0 and R1, you can do that in three parts:

number of values between R0 and the next power of two greater than R0 (call this next power-of-two value x)
number of values between R1 and the previous power of two less than R1 (call this previous power-of-two value y)
number of values in full binades between x and y

The number of values between R0 and x is (x - R0) / res1, where res1 is the resolution in the binade containing R0, which is epsilon times the previous power of two less than R0.

The number of values between R1 and y is (R1 - y) / res2, where res2 is the resolution in the binade containing R1, which is epsilon times y.

And then part 3 is 2⁵² times the number of binades, and the number of binades is one less than the number of exact powers of two between R0 and R1.

For example, if R0 is 12.34 and R1 is 345.678, the powers of 2 between R0 and R1 are 16, 32, 64, 128, and 256. There are 5 of them, so there are 4 full binades. x is 16, and y is 256.

The "epsilon" I referred to is a property of the floating point format, and is (by definition) the resolution of the binade [1.0, 2.0). For double precision, epsilon is 2.22045e-16.

There are obviously still a fair number of details to get right, and some edge cases — in particular, the cases where there are one or zero powers of two between R0 and R1. But this should get you started.

Modulator answered 7/9 at 22:52 Comment(0)

Recommended topics

Hot tags