Comparing uint64_t and float for numeric equivalence

Asked 27/9, 2015 at 17:11 Answered 28/9, 2015 at 9:22

I am writing a protocol, that uses RFC 7049 as its binary representation. The standard states, that the protocol may use 32-bit floating point representation of numbers, if their numeric value is equivalent to respective 64-bit numbers. The conversion must not lead to lose of precision.

What 32-bit float numbers can be bigger than 64-bit integer and numerically equivalent with them?
Is comparing float x; uint64_t y; (float)x == (float)y enough for ensuring, that the values are equivalent? Will this comparison ever be true?

RFC 7049 §3.6. Numbers

For the purposes of this specification, all number representations for the same numeric value are equivalent. This means that an encoder can encode a floating-point value of 0.0 as the integer 0. It, however, also means that an application that expects to find integer values only might find floating-point values if the encoder decides these are desirable, such as when the floating-point value is more compact than a 64-bit integer.

Laurasia answered 27/9, 2015 at 17:11 Comment(8)

What 32-bit float numbers can be bigger than 64-bit integer and numerically equivalent with them? None: by definition, a number X that's equal to Y can't be greater than Y. – Ichthyic 27/9, 2015 at 17:19

So long as a 32-bit float is an integer value, you won't lose precision converting it to a 64-bit integer. But if the original float is not an integer, you will lose precision. – Janetjaneta 27/9, 2015 at 17:19

Of course that comparison may be true. 1.0f == float(1ull) – Ichthyic 27/9, 2015 at 17:20

The question only makes sense if it is about which 64-bit integers can be represented as floats without loss of precision. This is also what the first paragraph states, the first bullet point is rather confusing. And there are obviously quite a few numbers for which that property is true (any power of 2 larger than 2^32 but smaller than 2^64 for one). – Despinadespise 27/9, 2015 at 17:20

The check that matters is this. If you take the original 64-bit integer value, convert it to float, then convert that float back to integer, and get the original value, then you can transmit the float in place of the integer; you can be sure that the other party can recover the original integer (because you've just tested it yourself). – Ichthyic 27/9, 2015 at 17:22

why do you think that respective 64bit numbers are integers? – Edla 27/9, 2015 at 18:50

Does this answer your question? How to properly compare an integer and a floating-point value? – Plasticizer 19/11, 2019 at 13:8

Compare a 32 bit float and a 32 bit integer without casting to double, when either value could be too large to fit the other type exactly – Plasticizer 19/11, 2019 at 13:8

There certainly are numbers for which this is true:

2^33 can be perfectly represented as a floating point number, but clearly cannot be represented as a 32-bit integer. The following code should work as expected:

bool representable_as_float(int64_t value) {
    float repr = value;
    return repr >= -0x1.0p63 && repr < 0x1.0p63 && (int64_t)repr == value;
}

It is important to notice though that we are basically doing (int64_t)(float)value and not the other way around - we are interested if the cast to float loses any precision.

The check to see whether repr is smaller than the maximum value of int64_t is important since we could invoke undefined behavior otherwise, since the cast to float may round up to the next higher number (which could then be larger than the maximum value possible in int64_t). (Thanks to @tmyklebu for pointing this out).

Two samples:

// powers of 2 can easily be represented
assert(representable_as_float(((int64_t)1) << 33));
// Other numbers not so much:
assert(!representable_as_float(std::numeric_limits<int64_t>::max()));

Despinadespise answered 27/9, 2015 at 17:23 Comment(4)

Don't you get UB by doing, say, (int64_t)(float)0x7fffffffffffffffLL? Here, the conversion to float rounds up, so the conversion to int64_t will overflow, which is UB. – Culpepper 27/9, 2015 at 17:31

@Culpepper Fair point. The conversion from int -> float is always safe (afaics?), but the other isn't. Which makes the whole thing rather more interesting. – Despinadespise 27/9, 2015 at 17:37

Umm... so, this can lead to UB, can it? – Laurasia 27/9, 2015 at 17:39

I think you need to check whether repr >= -0x1.0p63 && repr < 0x1.0p63 before doing the conversion to int64_t. – Culpepper 28/9, 2015 at 1:31

The following is based on Julia's method for comparing floats and integers. This does not require access to 80-bit long doubles or floating point exceptions, and should work under any rounding mode. I believe this should work for any C float type (IEEE754 or not), and not cause any undefined behaviour.

UPDATE: technically this assumes a binary float format, and that the float exponent size is large enough to represent 2⁶⁴: this is certainly true for the standard IEEE754 binary32 (which you refer to in your question), but not, say, binary16.

#include <stdio.h>
#include <stdint.h>

int cmp_flt_uint64(float x,uint64_t y) {
  return (x == (float)y) && (x != 0x1p64f) && ((uint64_t)x == y);
}

int main() {
  float x = 0x1p64f;
  uint64_t y = 0xffffffffffffffff;

  if (cmp_flt_uint64(x,y))
    printf("true\n");
  else 
    printf("false\n");
  ;
}

The logic here is as follows:

The first equality can be true only if x is a non-negative integer in the interval [0,2⁶⁴].
The second checks that x (and hence (float)y) is not 2⁶⁴: if this is the case, then y cannot be represented exactly by a float, and so the comparison is false.
Any remaining values of x can be exactly converted to a uint64_t, and so we cast and compare.

Pet answered 28/9, 2015 at 9:22 Comment(0)

-1

No, you need to compare (long double)x == (long double)y on an architecture where the mantissa of a long double can hold 63 bits. This is because some big long long ints will lose precision when you convert them to float, and compare as equal to a non-equivalent float, but if you convert to long double, it will not lose precision on that architecture.

The following program demonstrates this behavior when compiled with gcc -std=c99 -mssse3 -mfpmath=sse on x86, because these settings use wide-enough long doubles but prevent the implicit use of higher-precision types in calculations:

#include <assert.h>
#include <stdint.h>

const int64_t x = (1ULL<<62) - 1ULL;
const float y = (float)(1ULL<<62);
// The mantissa is not wide enough to store
// 63 bits of precision.

int main(void)
{
  assert ((float)x == (float)y);
  assert ((long double)x != (long double)y);

  return 0;
}

Edit: If you don’t have wide enough long doubles, the following might work:

feclearexcept(FE_ALL_EXCEPT);
x == y;
ftestexcept(FE_INEXACT);

I think, although I could be mistaken, that an implementation could round off x during the conversion in a way that loses precision.

Another strategy that could work is to compare

extern uint64_t x;
extern float y;
const float z = (float)x;

y == z && (uint64_t)z == x;

This should catch losses of precision due to round-off error, but it could conceivably cause undefined behavior if the conversion to z rounds up. It will work if the conversion is set to round toward zero when converting x to z.

Cockchafer answered 27/9, 2015 at 18:13 Comment(12)

I’m not sure why I got downvoted, but perhaps the code sample demonstrating this behavior will change your mind? – Cockchafer 27/9, 2015 at 18:54

I didn't downvote—I actually upvoted the new version— but yes, the example is what makes your answer valuable. There are some tricks for comparing an integer and a floating-point value without a long double type that can represent exactly all values of each origin type. gynvael.coldwind.pl/?id=535 twitter.com/spun_off/status/467929922259144704 – Cheltenham 27/9, 2015 at 19:18

Your advice at the bottom checks ftrunc(x) == x but x is an integer in your examples (and I am not sure it works even assuming the roles of x and y are reversed, e.g. float x = 0x1.0p62, int64_t y = 0x3fffffffffffffff). – Cheltenham 27/9, 2015 at 19:21

Whoops, that bit on the bottom was the answer to a different question. – Cockchafer 27/9, 2015 at 19:29

Well, this is a bit offtopic. I want to convert uint64_t to float and compare it to original uint64_t to check if I can pack a number using 32 bits instead 64 without losing original numeric value. – Laurasia 27/9, 2015 at 19:40

@Pascal Cuoc Thanks for the sample code! I’m not sure how portable it is, but I know long double isn't guaranteed to work for this everywhere. – Cockchafer 27/9, 2015 at 19:42

@Alexander Shishenko: Comparing to the result of a round-trip conversion works for that. – Cockchafer 27/9, 2015 at 19:45

@Lorehead I see, the example using feclearexcept/ftestexcept is very interesting. I will test it in my code soon. – Laurasia 27/9, 2015 at 19:47

And added the round-trip strategy. – Cockchafer 27/9, 2015 at 19:50

@Lorehead In what you call “the round-trip strategy”, (uint64_t)z can cause undefined behavior e.g. for x set to 0xffffffffffffffff causing z to be 0x1.0p64. – Cheltenham 27/9, 2015 at 19:52

You can still trigger UB in the round-trip solution by choosing float y = 0x1.0p64; and uint64_t x = 0xffffffffffffffff; – Cheltenham 27/9, 2015 at 20:12

@Pascal Cuoc: You are correct. I’d need to set the library to truncate on conversion. – Cockchafer 27/9, 2015 at 20:16

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags