What float values could not be converted to int without undefined behavior [c++]?
Asked Answered
P

3

8

I just read this from the C++14 standard (my emphasis):

4.9 Floating-integral conversions [conv.fpint]

1 A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type. [...]

Which got me thinking

  1. Which, if any, float values could not be represented as int after truncation? (Does that depend on the implementation?)
  2. If there are any, does this mean that auto x = static_cast<int>(float) is unsafe?
  3. what is the proper/safe way of converting float to int then (assuming you want truncation)?
Palacio answered 31/1, 2018 at 17:58 Comment(10)
Conciser the maximum magnitude of floats vs ints. See Range of values in: en.cppreference.com/w/cpp/language/types (3) see: en.cppreference.com/w/cpp/types/climitsEgidio
Also, I'd bet on NaN. NaN is always the weird one.Latanya
auto x = static_cast<int>(f); is just an obscure way of writing int x = f;. Neither is more or less safe than the other, regardless of how you define "safe".Theocritus
@PeteBecker I wouldn't call static_cast obscure. It's actually quite explicit that the cast is intended whereas in the case of int x = f;it's anyone's guess rather it's an intended implicit cast or a mistake.Stomacher
@FrançoisAndrieux -- I didn't say that static_cast is obscure. I said that the statement is obscure; there's more in it than the static_cast.Theocritus
See this blog postInfighting
@PeteBecker static_cast is the correct way to do it; most compilers support and many (most?) companies use warnings that will flag int x = f.Thalia
@NirFriedman -- my company doesn't. Labelling "the way I prefer to do it" as "the correct way to do it" is inappropriate.Theocritus
@PeteBecker Congrats, some companies don't use -Werror either. Point? There are recognized good practices. Flagging narrowing conversions are widely (but not universally) recognized as a good practice.Thalia
Sounds like you mean "to an integer-valued float". Because there's no "proper way" to make an int whose value cannot be represented by int.Surgeon
H
4

It shouldn't be surprising at all that float has values outside of int range. Floating-point values were invented to represent very large (and also very small) values adequately.

  1. INT_MAX + 1 (usually equal to 2147483648) cannot be represented by int, but can be represented by float.
  2. Yes, static_cast<int>(float) is as unsafe as undefined behavior can be. However, something as simple as x + y for sufficiently large integers x and y is also UB, so no big surprise here either.
  3. The proper way to do stuff depends on the application, as always in C++. Boost has numeric_cast that throws an exception on overflow; this might be good for you. To do saturation (convert too big values to INT_MIN and INT_MAX), write some code like this

    float f;
    int i;
    ...
    if (static_cast<double>(INT_MIN) <= f && f < static_cast<double>(INT_MAX))
        i = static_cast<int>(f);
    else if (f < 0)
        i = INT_MIN;
    else
        i = INT_MAX;
    

    However, this is not ideal. Does your system have double type that can represent the maximal value of int? If yes, it will work. Also, how exactly do you want to round values that are close to minimum or maximum of int? If you don't want to consider such questions, use boost::numeric_cast, as described here.

Huddle answered 31/1, 2018 at 18:45 Comment(2)
I was perhaps not clear enough with "proper/safe what" in point 3. I should have added something like "assuming you want to keep to the truncation" (see edit).Palacio
Yes this works for common implementations, but fails when int and float are wide, like 64-bit. True that OP's alludes to that limitation.Crambo
P
8

We hit this a while back and I manually made some tables that have the exact bit patterns of floats at the edges of various conversions to various sizes of integers. Note this assumes iee754 4 byte floats and 8 bytes doubles and 2's complement signed integers (int32_t of 4 bytes and int64_t of 8 bytes).

If you need to convert the bit patterns to floats or doubles you'll need to either type pun them (technically UB) or memcpy them.

And to answer your question anything which is too big to fit in the target integer is UB on conversion, and the only time when the truncating to zero matters is double -> int32_t. So using the following values you can compare the float against the relevant min/max and only cast if they're in range.

Note that using INT_MIN/INT_MAX (or their modern limit counterparts) to cross convert and then compare doesn't always work as the accuracy of floats for those sized values are very low.

Inf/NaN are also UB on conversion.

// float->int64 edgecases
static const uint32_t FloatbitsMaxFitInt64 = 0x5effffff; // [9223371487098961920] Largest float which still fits int an signed int64
static const uint32_t FloatbitsMinNofitInt64 = 0x5f000000; // [9223372036854775808] the bit pattern of the smallest float which is too big for a signed int64
static const uint32_t FloatbitsMinFitInt64 = 0xdf000000; // [-9223372036854775808] Smallest float which still fits int an signed int64
static const uint32_t FloatbitsMaxNotfitInt64 = 0xdf000001; // [-9223373136366403584] Largest float which to small for a signed int64

// float->int32 edgecases
static const uint32_t FloatbitsMaxFitInt32 = 0x4effffff; // [2147483520] the bit pattern of the largest float which still fits int an signed int32
static const uint32_t FloatbitsMinNofitInt32 = 0x4f000000; // [2147483648] the bit pattern of the smallest float which is too big for a signed int32
static const uint32_t FloatbitsMinFitInt32 = 0xcf000000; // [-2147483648] the bit pattern of the smallest float which still fits int an signed int32
static const uint32_t FloatbitsMaxNotfitInt32 = 0xcf000001; // [-2147483904] the bit pattern of the largest float which to small for a signed int32

// double->int64 edgecases
static const uint64_t DoubleBitsMaxFitInt64 = 0x43dfffffffffffff; // [9223372036854774784] Largest double which fits into an int64
static const uint64_t DoubleBitsMinNofitInt64 = 0x43e0000000000000; // [9223372036854775808] Smallest double which is too big for an int64
static const uint64_t DoubleBitsMinFitInt64 = 0xc3e0000000000000; // [-9223372036854775808] Smallest double which fits into an int64
static const uint64_t DoubleBitsMaxNotfitInt64 = 0xc3e0000000000001; // [-9223372036854777856] largest double which is too small to fit into an int64

// double->int32 edgecases[when truncating(round towards zero)]
static const uint64_t DoubleBitsMaxTruncFitInt32 = 0x41dfffffffffffff; // [~2147483647.9999998] Largest double that when truncated will fit into an int32
static const uint64_t DoubleBitsMinTruncNofitInt32 = 0x41e0000000000000; // [2147483648.0000000] Smallest double that when truncated wont fit into an int32
static const uint64_t DoubleBitsMinTruncFitInt32 = 0xc1e00000001fffff; // [~2147483648.9999995] Smallest double that when truncated will fit into an int32
static const uint64_t DoubleBitsMaxTruncNofitInt32 = 0xc1e0000000200000; // [2147483649.0000000] Largest double that when truncated wont fit into an int32

// double->int32 edgecases [when rounding via bankers method(round to nearest, round to even on half)]
static const uint64_t DoubleBitsMaxRoundFitInt32 = 0x41dfffffffdfffff; // [2147483647.5000000] Largest double that when rounded will fit into an int32
static const uint64_t DoubleBitsMinRoundNofitInt32 = 0x41dfffffffe00000; // [~2147483647.5000002] Smallest double that when rounded wont fit into an int32
static const uint64_t DoubleBitsMinRoundFitInt32 = 0xc1e0000000100000; // [-2147483648.5000000] Smallest double that when rounded will fit into an int32
static const uint64_t DoubleBitsMaxRoundNofitInt32 = 0xc1e0000000100001; // [~2147483648.5000005] Largest double that when rounded wont fit into an int32

So for your example you want:

if( f >= B2F(FloatbitsMinFitInt32) && f <= B2F(FloatbitsMaxFitInt32))
    // cast is valid.

Where B2F is something like:

float B2F(uint32_t bits)
{
    static_assert(sizeof(float) == sizeof(uint32_t), "Weird arch");
    float f;
    memcpy(&f, &bits, sizeof(float));
    return f;
}

Note that this conversion picks up nans/inf correctly (as comparisons with them are false) unless you're using a non-iee754 mode of your compiler (e.g. ffast-math on gcc or /fp:fast on msvc)

Padding answered 31/1, 2018 at 18:37 Comment(7)
Thanks for the answer. Could you clarify what you meant with the bit about "when the truncating to zero matters" and the part about comparing to INT_MIN/INT_MAX?Palacio
The range of float when its got a close value to INT_MAX is something like 1 float every 150 numbers. So it hasn't got a fractional part for numbers of that magnitude. (But doubles with values close to them do). INT_MIN and INT_MAX cannot be represented by floats so casting INT_MIN to a float is lossy and therefore any calculation you do with it will have an error. And you must be exact to not give UB.Padding
OK, and what do you mean by "the only time when the truncating to zero matters is float -> int64_t"?Palacio
EDIT: I meant double -> int32!. For INT_MAX (~2147483647) the 2 closest floats are 2147483520 below it and 2147483648 above it. There's simply no floating point representation of 2147483647.xxx where truncation matters. For double -> int32 there are double values with things like 2147483647.99 and so as the standard says to truncate this means that 2147483647.99 WILL fit into an int32.Padding
OK, great. Very insightful answer, even if it assumes particular representationsPalacio
@Palacio I thought a wile about this and it seemed really hard to get better than this. The language really doesn't make it easy to do this in a portable way.Padding
Right, and so your answer is very valuable, but I suppose I would rather offload that to boost::numeric_cast as suggested in another answer.Palacio
H
4

It shouldn't be surprising at all that float has values outside of int range. Floating-point values were invented to represent very large (and also very small) values adequately.

  1. INT_MAX + 1 (usually equal to 2147483648) cannot be represented by int, but can be represented by float.
  2. Yes, static_cast<int>(float) is as unsafe as undefined behavior can be. However, something as simple as x + y for sufficiently large integers x and y is also UB, so no big surprise here either.
  3. The proper way to do stuff depends on the application, as always in C++. Boost has numeric_cast that throws an exception on overflow; this might be good for you. To do saturation (convert too big values to INT_MIN and INT_MAX), write some code like this

    float f;
    int i;
    ...
    if (static_cast<double>(INT_MIN) <= f && f < static_cast<double>(INT_MAX))
        i = static_cast<int>(f);
    else if (f < 0)
        i = INT_MIN;
    else
        i = INT_MAX;
    

    However, this is not ideal. Does your system have double type that can represent the maximal value of int? If yes, it will work. Also, how exactly do you want to round values that are close to minimum or maximum of int? If you don't want to consider such questions, use boost::numeric_cast, as described here.

Huddle answered 31/1, 2018 at 18:45 Comment(2)
I was perhaps not clear enough with "proper/safe what" in point 3. I should have added something like "assuming you want to keep to the truncation" (see edit).Palacio
Yes this works for common implementations, but fails when int and float are wide, like 64-bit. True that OP's alludes to that limitation.Crambo
C
0
  1. Which, if any, float values could not be represented as int after truncation?

After the float value is truncated, the whole number value must be in the INT range [INT_MIN ... INT_MAX]. If outside this range, or not-a-number, conversion is UB.

  1. If there are any, does this mean that auto x = static_cast(float) is unsafe?

Yes, for many float values.

  1. what is the proper/safe way of converting float to int then (assuming you want truncation)?

To test if float to int succeeds, test the limits with carefully constructed float values that are exact and incurred no FP rounding in their derivation. No need for wider types like double.

Take advantage INT_MIN is a negated power-of-2 and INT_MAX is one less than a power-of-2. Form 2 limits exactly: INT_MIN_FLT, INT_MAXP1_FLT (INT_MAX plus 1).

With common 32-bit int, conversion well specified for float in the -2,147,483,648.999... to +2,147,483,647.999... range, not -2,147,483,648.0 to +2,147,483,647.0.

C-like answer, yet should be realizable in C++.

// One more than INT_MAX
#define INT_MAXP1_FLT (static_cast<float>(INT_MAX/2 + 1) * 2.0f)
#define INT_MIN_FLT   (static_cast<float>INT_MIN)
float f;
int i;

// Avoid this as INT_MAXP1_FLT - 1.0f may be inexact
// if (f < INT_MAXP1_FLT && f > INT_MAXP1_FLT - 1.0f) {

if (f < INT_MAXP1_FLT && f - INT_MAXP1_FLT > -1.0f) {
  i = static_cast<int>(f);
else if (f < 0)
  i = INT_MIN;
else if (f > 0)
  i = INT_MAX; 
else 
  i = 0; // NAN case - best to do a isnan(f) test up front.

This approach works as long as the xxx_INT < FLT_MAX. E.g. we are not dealing with some 128 bit integer type like uint128_t.


This approach extends well to double, long double and the various integer types, both signed and unsigned.

Crambo answered 29/9, 2022 at 7:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.