Is a float guaranteed to be preserved when transported through a double in C/C++?

Asked 8/2, 2013 at 13:0 Answered 11/2, 2013 at 23:12

Solved c++c floating-point double ieee-754

Assuming IEEE-754 conformance, is a float guaranteed to be preserved when transported through a double?

In other words, will the following assert always be satisfied?

int main()
{
    float f = some_random_float();
    assert(f == (float)(double)f);
}

Assume that f could acquire any of the special values defined by IEEE, such as NaN and Infinity.

According to IEEE, is there a case where the assert will be satisfied, but the exact bit-level representation is not preserved after the transportation through double?

The code snippet is valid in both C and C++.

Usurpation answered 8/2, 2013 at 13:0 Comment(13)

IEEE-754 doesn't specify what happens when languages cast. – Hutchens 8/2, 2013 at 13:2

@DavidHeffernan: Ok, but if we assume IEEE and C++03/C++11 conformance, can anything be said about it then? – Usurpation 8/2, 2013 at 13:5

As I understand it, all floats are exactly representable as doubles and I think you can draw the conclusion from there. – Hutchens 8/2, 2013 at 13:6

@DavidHeffernan: I think so too, but I was hoping for confirmation (or the opposite). – Usurpation 8/2, 2013 at 13:7

Note that the assert will fail whenever f is a NaN, regardless of what the conversions do. – Lawyer 8/2, 2013 at 13:42

@LightnessRacesinOrbit, agreed, I don't think the code snippet is valid C? You cannot do double(f) in C it is a syntax error. – Woodward 8/2, 2013 at 14:58

@Josh: Yes but more importantly even just the rules of casting can differ wildly between the two languages. "The question is about either C or C++" should never be the default approach. – Hachure 8/2, 2013 at 15:2

"According to IEEE, is there a case where the assert will be satisfied, but the exact bit-level representation is not preserved" - Well, in fact NaNs and 0s don't need to be bit exact to be classfied as NaNs and 0s, with the corresponding implications for == (NaN != NaN, 0 == -0). – Mimicry 8/2, 2013 at 15:3

@LightnessRacesinOrbit, to be pedantic (BTW, I agree the question is flawed) the user is constructing a new double and a new float, not casting (AFAIK, in the strict sense of the word for the language of C++)? So what happens depends on the constructor, not the cast? – Woodward 8/2, 2013 at 15:6

@JoshPetitt Well, in the strict sense of the standard he is indeed casting and not constructing, since floats and doubles don't have constructors. float(x) is a "function-style cast expression" and not a constructor call. – Mimicry 8/2, 2013 at 15:11

By popular request I have modified the code snippet such that it is valid in both C and C++. The original meaning is intended to be preserved 100%.

I don't remember any such request. The request is to choose one language. – Hachure 8/2, 2013 at 15:27

@LightnessRacesinOrbit: Well, in the event that there is a different answer for C and C++, I am very interested in the details of that difference. That's why I want to preserve it as a question targeting both languages. – Usurpation 8/2, 2013 at 15:29

Remember that checking for equality in doubles/floats is unstable. Check this #17833 – Grabble 12/2, 2013 at 19:46

You don't even need to assume IEEE. C89 says in 3.1.2.5:

The set of values of the type float is a subset of the set of values of the type double

And every other C and C++ standard says equivalent things. As far as I know, NaNs and infinities are "values of the type float", albeit values with some special-case rules when used as operands.

The fact that the float -> double -> float conversion restores the original value of the float follows (in general) from the fact that numeric conversions all preserve the value if it's representable in the destination type.

Bit-level representations are a slightly different matter. Imagine that there's a value of float that has two distinct bitwise representations. Then nothing in the C standard prevents the float -> double -> float conversion from switching one to the other. In IEEE that won't happen for "actual values" unless there are padding bits, but I don't know whether IEEE rules out a single NaN having distinct bitwise representations. NaNs don't compare equal to themselves anyway, so there's also no standard way to tell whether two NaNs are "the same NaN" or "different NaNs" other than maybe converting them to strings. The issue may be moot.

One thing to watch out for is non-conforming modes of compilers, in which they keep super-precise values "under the covers", for example intermediate results left in floating-point registers and reused without rounding. I don't think that would cause your example code to fail, but as soon as you're doing floating-point == it's the kind of thing you start worrying about.

Essence answered 8/2, 2013 at 13:20 Comment(7)

Cool, that was exactly what I was hoping for. Could you add the section number from the standard? – Usurpation 8/2, 2013 at 13:21

@KristianSpangsege: done for C89. If you want it for the other 4 standards you're warmly welcome to look them all up yourself ;-). In each case I reckon it will be in the section that introduces and lists the floating-point types. – Essence 8/2, 2013 at 13:24

FWIW, NaNs usually do have multiple representations. On Intel architectures, infinities and NaNs have all 1's in their exponent. Infinities have a fraction of 0, and all non-0 fractions are NaNs. If the second-highest bit of the fraction is 0 it's a signalling NaN; if the second-highest bit is 1 it's a quite NaN. (The highest bit in the fraction is funky because for floats and doubles it's not stored, but deduced from context [for subnormals it's 0, for normals its 1]). – Affra 8/2, 2013 at 14:33

@PeteBecker: that's why I'm vaguely wondering whether those are two values, each with many different representations, or lots of different values all of which are NaNs and each of which has only one representation. I don't think the C standard ventures an opinion, except in that it gives implementations flexibility how to represent NaNs as strings, and I sort of presume that two NaNs that yield different strings are different values. – Essence 8/2, 2013 at 14:49

"In IEEE that won't happen for "actual values"" - I'd regard 0 (and likewise -0) as an "actual value" with two distinct bit representations, no? Though, they don't have an impact on the behaviour of ==, aynway. – Mimicry 8/2, 2013 at 15:5

@SteveJessop - as you say, the C standard doesn't impose much in the way of requirements on the internal representation. It's the external semantics that matter. When you divide 0 by 0 you get a NaN, NaN does not compare equal to any floating-point value, etc. It's not particularly important, unless you write (inherently non-portable) code that relies on particular values of NaNs. Short of that, the math processor handles the gory details and produces the "right" semantics. After all, IEEE-754 was largely shaped by the behavior of the Intel 8087 math co-processor and its progeny. – Affra 8/2, 2013 at 15:7

@ChristianRau: I'm not sure, but I think it's two different IEEE values because the results of using them with certain operations are defined to be different. For the purposes of this question, the significant issue is whether under the C standard and IEEE it's permissible for a float negative zero to convert to a double positive zero or vice-versa. – Essence 8/2, 2013 at 15:21

From C99:

6.3.1.5 Real floating types
1 When a float is promoted to double or long double, or a double is promoted to long double, its value is unchanged.
2 When a double is demoted to float, a long double is demoted to double or float, or a value being represented in greater precision and range than required by its semantic type (see 6.3.1.8) is explicitly converted to its semantic type, if the value being converted can be represented exactly in the new type, it is unchanged...

I think, this guarantees you that a float->double->float conversion is going to preserve the original float value.

The standard also defines the macros INFINITY and NAN in 7.12 Mathematics <math.h>:

4 The macro INFINITY expands to a constant expression of type float representing positive or unsigned infinity, if available; else to a positive constant of type float that overflows at translation time.
5 The macro NAN is defined if and only if the implementation supports quiet NaNs for the float type. It expands to a constant expression of type float representing a quiet NaN.

So, there's provision for such special values and conversions may just work for them as well (including for the minus infinity and negative zero).

Liva answered 8/2, 2013 at 13:8 Comment(2)

Nice, but do you interpret this to also mean that NaN and Inf, etc. are preserved? – Usurpation 8/2, 2013 at 13:10

I'd assume so, at least in the context of IEEE-754, which defines both infinities and a NaN for float and double. But a more through analysis of the standard wouldn't hurt. – Liva 8/2, 2013 at 13:14

The assertion will fail in flush-to-zero and/or denormalized-is-zero mode (e.g. code compiled with -mfpmath=sse, -fast-math, etc, but also on heaps of compilers and architectures as default, such as Intel's C++ compiler) if f is denormalized.

You cannot produce a denormalized float in that mode though, but the scenario is still possible:

a) Denormalized float comes from external source.

b) Some libraries tamper with FPU modes but forget (or intentionally avoid) setting them back after each function call to it, making it possible for caller to mismatch normalization.

Practical example which prints following:

f = 5.87747e-39
f2 = 5.87747e-39

f = 5.87747e-39
f2 = 0
error, f != f2!

The example works both for VC2010 and GCC 4.3 but assumes that VC uses SSE for math as default and GCC uses FPU for math as default. The example may fail to illustrate the problem otherwise.

#include <limits>
#include <iostream>
#include <cmath>

#ifdef _MSC_VER
#include <xmmintrin.h>
#endif

template <class T>bool normal(T t)
{
    return (t != 0 || fabsf( t ) >= std::numeric_limits<T>::min());
}

void csr_flush_to_zero()
{
#ifdef _MSC_VER
    _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
#else
    unsigned csr = __builtin_ia32_stmxcsr();
    csr |= (1 << 15);
    __builtin_ia32_ldmxcsr(csr);
#endif
}

void test_cast(float f) 
{
    std::cout << "f = " << f << "\n";
    double d = double(f);
    float f2 = float(d);
    std::cout << "f2 = " << f2 << "\n";

    if(f != f2)
        std::cout << "error, f != f2!\n";

    std::cout << "\n";
}

int main()
{
    float f = std::numeric_limits<float>::min() / 2.0;

    test_cast(f);
    csr_flush_to_zero();
    test_cast(f);
}

Populate answered 11/2, 2013 at 23:12 Comment(4)

Interesting find, however it looks like 'flush to zero' mode is not compliant with IEEE. See, for example, infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473c/… – Usurpation 12/2, 2013 at 0:37

It also seems that enabling 'flush to zero' mode would be incompatible with the Standard C++ library from GNU, since it fixes std::numeric_limits<float>::has_denorm to std::denorm_present. It would be interesting to see whether MSVC does the same, or whether it chooses std::denorm_indeterminate – Usurpation 12/2, 2013 at 0:49

All explanations of has_denorm I've found (various compilers/architectures) say it's a compile time constant. And it says denorm_present on VC too. So I guess it says what the hardware supports and not what the current active runtime mode is. – Populate 12/2, 2013 at 2:12

I meant gcc, not gvv. Also flush-to-zero and denormal-as-zero are not IEEE, but they're still widely used default C++ compiler settings and default hardware modes (they'are also named FTZ and DAZ) – Populate 12/2, 2013 at 2:15

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags