How do I convert an arbitrary double to an integer while avoiding undefined behavior?

Asked 15/9, 2014 at 22:25 Answered 1/2, 2023 at 17:42

Solved c++type-conversion language-lawyer undefined-behavior

Let's say I've got a function that accepts a 64-bit integer, and I want to call it with a double with arbitrary numeric value (i.e. it may be very large in magnitude, or even infinite):

void DoSomething(int64_t x);

double d = [...];
DoSomething(d);

Paragraph 1 of [conv.fpint] in the C++11 standard says this:

A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion trun- cates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.

Therefore there are many values of d above that will cause undefined behavior. I would like conversion to saturate, so that values greater than std::numeric_limits<int64_t>::max() (called kint64max below), including infinity, become that value, and similarly with the minimum representable value. This seems the natural approach:

double clamped = std::min(d, static_cast<double>(kint64max));
clamped = std::max(clamped, static_cast<double>(kint64min));
DoSomething(clamped);

But, the next paragraph in the standard says this:

A prvalue of an integer type or of an unscoped enumeration type can be converted to a prvalue of a floating point type. The result is exact if possible. If the value being converted is in the range of values that can be represented but the value cannot be represented exactly, it is an implementation-defined choice of either the next lower or higher representable value.

So clamped may still wind up being kint64max + 1, and behavior may still be undefined.

What is the simplest portable way to do what I'm looking for? Bonus points if it also gracefully handles NaNs.

Update: To be more precise, I would like the following to all be true of an int64_t SafeCast(double) function that solves this problem:

For any double d, calling SafeCast(d) does not perform undefined behavior according to the standard, nor does it throw an exception or otherwise abort.
For any double d in the range [-2^63, 2^63), SafeCast(d) == static_cast<int64_t>(d). That is, SafeCast agrees with C++'s conversion rules wherever the latter is defined.
For any double d >= 2^63, SafeCast(d) == kint64max.
For any double d < -2^63, SafeCast(d) == kint64min.

I suspect the true difficulty here is in figuring out whether d is in the range [-2^63, 2^63). As discussed in the question and in comments to other answers, I think using a cast of kint64max to double to test for the upper bound is a non-starter due to undefined behavior. It may be more promising to use std::pow(2, 63), but I don't know whether this is guaranteed to be exactly 2^63.

Feudal answered 15/9, 2014 at 22:25 Comment(8)

static_cast kint64max + 1ULL (or (uint64_t) 1), which should be exactly representable, and then use std::nextafter to get the previous representable value, and clamp down to that. – Contribution 15/9, 2014 at 22:35

What @Contribution said. The standard doesn't guarantee it, but integer powers of 2 can be stored without loss up to the limits of the floating point format, in every floating point system I'm aware of. – Jerri 15/9, 2014 at 22:44

What about NaN and Infinity? – Jelene 15/9, 2014 at 22:47

@ArnonZilca min and max will work for Infinity but not NaN. You need separate testing for that case, and it's unclear what should be returned. – Jerri 15/9, 2014 at 22:51

What value would you want in the case of NaN? – Corticate 15/9, 2014 at 22:56

@T.C.: I don't quite grok your comment. Could you expand it to a detailed answer? – Feudal 15/9, 2014 at 22:57

@OliverCharlesworth: Zero would be fine. Whatever is convenient. – Feudal 15/9, 2014 at 23:7

@OliverCharlesworth It's not about the value, but about knowing the cast failed to set the int. – Jelene 15/9, 2014 at 23:7

It turns out this is simpler to do than I thought. Thanks to Michael O'Reilly for the basic idea of this solution.

The heart of the matter is figuring out whether the truncated double will be representable as an int64_t. You can do this easily using std::frexp:

#include <cmath>
#include <limits>

static constexpr int64_t kint64min = std::numeric_limits<int64_t>::min();
static constexpr int64_t kint64max = std::numeric_limits<int64_t>::max();

int64_t SafeCast(double d) {
  // We must special-case NaN, for which the logic below doesn't work.
  if (std::isnan(d)) {
    return 0;
  }

  // Find that exponent exp such that
  //     d == x * 2^exp
  // for some x with abs(x) in [0.5, 1.0). Note that this implies that the
  // magnitude of d is strictly less than 2^exp.
  //
  // If d is infinite, the call to std::frexp is legal but the contents of exp
  // are unspecified.
  int exp;
  std::frexp(d, &exp);

  // If the magnitude of d is strictly less than 2^63, the truncated version
  // of d is guaranteed to be representable. The only representable integer
  // for which this is not the case is kint64min, but it is covered by the
  // logic below.
  if (std::isfinite(d) && exp <= 63) {
    return d;
  }

  // Handle infinities and finite numbers with magnitude >= 2^63.
  return std::signbit(d) ? kint64min : kint64max;
}

Feudal answered 27/10, 2014 at 9:9 Comment(0)

Here's a solution that doesn't fit all the criteria, along with analysis for why not. See the accepted answer for a better answer.

// Define constants from the question.
static constexpr int64_t kint64min = std::numeric_limits<int64_t>::min();
static constexpr int64_t kint64max = std::numeric_limits<int64_t>::max();

int64_t SafeCast(double d) {
  // Handle NaN specially.
  if (std::isnan(d)) return 0;

  // Handle out of range below.
  if (d <= kint64min) return kint64min;

  // Handle out of range above.
  if (d >= kint64max) return kint64max;

  // At this point we know that d is in range.
  return d;
}

I believe this avoids undefined behavior. There is nothing to be wary of with casting integers to doubles in the range checks. Assuming sanity in the way that non-representable integers are converted (in particular that the mapping is monotonic), by the time the range checks are past, we can be sure that d is in [-2^63, 2^63), as required for the implicit cast at the end of the function.

I'm also confident that this clamps out of range values correctly.

The issue is criteria #2 from the update to my question. Consider an implementation where kint64max is not representable as a double, but kint64max - 1 is. Further, assume that this is an implementation where casting kint64max to a double yields the next lower representable value, i.e. kint64max - 1. Let d be 2^63 - 2 (i.e. kint64max - 1). Then SafeCast(d) is kint64max, because the range check converts kint64max to a double, yielding a value equal to d. But static_cast<int64_t>(d) is kint64max - 1.

Try as I might, I can't find a way to resolve this. Nor can I even write a unit test that checks my criteria, without the unit test executing undefined behavior. I feel like there is a deeper lesson to be learned here—something about the impossibility of detecting whether an action in a system will cause undefined behavior from inside the system itself, without causing undefined behavior.

Feudal answered 29/9, 2014 at 9:52 Comment(3)

If an implementation uses anything resembling IEEE-754 math, INT64_MAX won't be precisely representable but INT64_MIN will be; would there be any problem with checking whether d >= 0 and -d>=INT64_MIN and, if so, casting -d to an int64_t and then inspecting the result? That would avoid any need to extract an exponent. – Lanyard 21/4, 2015 at 15:42

INT64_MAX is required to be 0x7fffffffffffffff, and with IEEE-754 double the conversion result will round up by 1, to 0x1p63. (exploringbinary.com/floating-point-converter). A static_assert( kint64max <= (int64_t)(double)kint64max ); may be sufficient for this check, or maybe static_assert( kint64max <= (int64_t)(double)(kint64max - 1) ); to make sure even max-1 rounds up. (Although that's not the case for 80-bit extended precision which can exactly represent every int64_t.) – Packer 11/4, 2021 at 23:55

Do you have a citation from a standard? I don't think 2^63-1 is guaranteed to round up; in fact I'm pretty sure I saw it round down while working on this problem. Anyway, please see the accepted answer for a better solution. – Feudal 13/4, 2021 at 2:22

Here's a solution without std::frexp. It exploits uint64_t for the tricky case.

#include <algorithm>
#include <cmath>
#include <cstdint>
#include <limits>

int64_t SafeCast(double d) {
    if (std::isnan(d)) {
        return 0;
    }

    if (d < 0) {
        // Easy case to clamp, because std::numeric_limits<int64_t>::min() is
        // exactly representable as IEEE double.
        return static_cast<int64_t>(std::max<double>(d, std::numeric_limits<int64_t>::min()));
    }

    // Convert to uint64_t, clamping to range [0..2^63].
    uint64_t u = static_cast<uint64_t>(std::min<double>(d, static_cast<uint64_t>(1) << 63));

    // Clamp to int64_t.
    return std::min(u, (static_cast<uint64_t>(1) << 63) - 1U);
}

Yuji answered 1/2, 2023 at 17:42 Comment(2)

Actually, I wonder if int64min is guaranteed to be representable by the standard? I know I'm talking about theoretical problems here, but just in a language lawyer sense. – Feudal 3/2, 2023 at 8:42

@Feudal - You are correct. If the target machine's double has a severely limited exponent range (e.g. IEEE FP16), or a non-power-of-two base (e.g. base 10), then int64min might not be exactly representable. IBM has decimal floating-point on some of their machines as an option, but under a different type name. See ibm.com/docs/en/zos/… . – Yuji 3/2, 2023 at 15:20

-1

How about:

constexpr uint64_t weird_high_limit = (double)kint64max == (double)(kint64max-1);
int64_t clamped = (d >= weird_high_limit + kint64max)? kint64max: (d <= kint64min)? kint64min: int64_t(d);

I think this takes care of all the edge cases. If d < (double)kint64max, then (exact)d <= (exact)kint64max. Proof proceeds by contradiction of the fact that (double)kint64max is the next higher or lower representable value.

Tacnode answered 15/9, 2014 at 22:56 Comment(6)

If (double)kint64max goes to the next lower representable value, and d == (double)kint64max, then you get kint64max even though you should get int64_t(d) (which is smaller than kint64max). – Contribution 15/9, 2014 at 23:6

I think the standard allows for kint64max - 1 to be exactly representable as a double, but kint64max to not be. Then kint64max may convert to the double kint64max - 1 in the first comparison. So if d is kint64max - 1, it will incorrectly convert to kint64max. – Feudal 15/9, 2014 at 23:6

@T.C.: Well, that isn't pretty, but it can be explicitly tested for. – Tacnode 15/9, 2014 at 23:10

@Contribution Yeah I was just adding that unsigned long long. – Tacnode 15/9, 2014 at 23:12

@jacobsa: But then the error introduced is no more than 0.5ULP. Can't see how that is a problem. Tempted to revert back to the original, for that reason. – Tacnode 15/9, 2014 at 23:21

Well sure the error introduced is small, but of course it can be a problem if you assume a value is <= some other value, but it is not. – Feudal 16/9, 2014 at 0:0

-1

boost::numeric_cast, that's how.

http://www.boost.org/doc/libs/1_56_0/libs/numeric/conversion/doc/html/boost_numericconversion/improved_numeric_cast__.html

Dang answered 28/9, 2014 at 16:16 Comment(4)

This was already an answer that was withdrawn. The source I found Googling [numeric_cast source] contains this expression for determining whether arg is in range: !(arg > result_traits::max()). I believe this may cause undefined behavior. result_traits::max() is allowed to become the double representing 2^63, then this will say that a value of 2^63 for arg is in range. But converting 2^63 to an int64_t is undefined behavior. – Feudal 28/9, 2014 at 21:8

@Feudal Ah, I don't have visibility to the withdrawn answer to see how that affected the flow of this discussion. – Dang 28/9, 2014 at 21:48

Yep, sorry, that's not really relevant. I guess what I'm saying is: it's good to know that bost::numeric_cast exists, but what I'm interested in here is what technique does it use. The source code I'm looking at does not look correct (or at least portable), but I may be looking at the wrong source code. – Feudal 29/9, 2014 at 9:34

Perhaps there are some clues in boost.org/doc/libs/1_56_0/libs/numeric/conversion/doc/html/…. They mention an OverflowHandler policy, and a Float2IntRounder policy. How the rounding happens is shown in boost.org/doc/libs/1_56_0/libs/numeric/conversion/doc/html/…. It is not immediately apparent if/where the documentation addresses the precision issues with range checking. – Dang 29/9, 2014 at 13:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags