How is float variable auto-promoted to double type?
Asked Answered
M

3

4

I know in C and Java, float's underlying representation is IEEE754-32, double is IEEE754-64.

In expressions, float will be auto-promoted to double. So how? Take 3.7f for example. Is the process like this?

  1. 3.7f will be represented in memory using IEEE754. It fits in 4 bytes.
  2. During calculation, it may be loaded into a 64-bit register (or whatever 64-bit place), turning the 3.7f into IEEE754-64 represent.
Markowitz answered 25/8, 2012 at 2:47 Comment(1)
I know in C/Java, float point number's underlying represent is IEEE754-32, double point's is IEEE754-64. There are nothing called "float point" or "double point". float and double are 2 floating-point types in many languages, typically map to single precision (A.K.A binary32) and double precision (A.K.A binary64) in IEEE-754. No such thing as IEEE754-32 and IEEE754-64 eitherLueluebke
D
5

It is very implementation-dependent.

For one example, on x86 platform the set of FPU commands includes commands for loading/storing data in IEEE754 float and double formats (as well as many other formats). The data is loaded into the internal FPU registers that have 80-bit width. So in reality on x86 all floating-point calculations are performed with 80-bit floating-point precision. i.e. all floating-point data is actually promoted to 80-bit precision. How is data represented inside those registers is completely irrelevant, since you cannot observe them directly anyway.

This means that on x86 platform there's no such thing as a single-step float-to-double conversion. Whenever a need for such conversion arises, it is actually implemented as two-step conversion: float-to-internal-fpu and internal-fpu-to-double.

This BTW created a significant semantic difference between x86 FPU computation model and C/C++ computation models. In order to fully match the language model the processor has to forcefully reduce precision of intermediate floating-point results, thus negatively affecting performance. Many compilers provide user with options that control FPU computation model, allowing the user to opt for strict C/C++ conformance, better performance or something in between.

Not so many years ago FPU unit was an optional component of x86 platform. Floating-point computations on FPU-less platforms were performed in software, either by emulating FPU or by generating code without any FPU instructions at all. In such implementations things could work differently, like, for example, perform software conversion from IEEE754 float to IEEE754 double directly.

Domineering answered 25/8, 2012 at 3:0 Comment(4)
So where & when does the IEEE754 format conversion take place? Since you said the FPU use 80-bit represent, not IEEE754.Markowitz
@larmbr: I'm not sure I understand your question. On modern x86 the conversions are implemented inside CPU/FPU. The FPU commands can read IEEE data from memory into 80-bit registers and store it back to memory. Whatever conversion-related steps are necessary for this are implemented inside the CPU/FPU as hardware and/or microcode.Domineering
"This means that on x86 platform there's no such thing as a single-step float-to-double conversion. " depends which version of the x86, modern x86 chips have two seperate sets of floating point instructions. The old-school x87 instructions which behave as you describe and the SSE instructions that work with single and double precision floating point directly.Pedrick
plugwash is correct about x87 and SSE implementations both being possible outcomes. In addition, under (almost?) all versions of Windows the x87 is setup in such a way that it behaves as though the internal register is only 64 bits (51 bit mantissa as I recall).Monied
L
1

I know in C and Java, float's underlying representation is IEEE754-32, double is IEEE754-64.

Wrong! The C standard has never specified a fixed format and/or specific limit in integer and floating-point types' sizes although they did ensure the relation between types:

1 == sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long)
sizeof(float) <= sizeof(double) <= sizeof(long double)

C implementations are allowed to use any floating-point formats although most nowadays use IEEE-754. Likewise they can freely use any of the 3 allowed integer representations including 1's complement or sign-magnitude


How is float variable auto-promoted to double type?

It's never promoted. Pre-standard versions of C promote floats in expressions to double but in C89/90 the rule was changed and float * float results in a float result:

If either operand has type long double, the other operand is converted to long double
Otherwise, if either operand is double, the other operand is converted to double.
Otherwise, if either operand is float, the other operand is converted to float.

https://mcmap.net/q/125452/-implicit-type-conversion-rules-in-c-operators

It would be true in Java or C# though, since they run bytecode in a virtual machine, and the VM's types are consistent across platforms

Lueluebke answered 19/12, 2013 at 14:42 Comment(0)
M
1

Excellent answer by phuclv. I'd like to add a few things.

In general, C evaluates expressions left to right. If you cast an operand to a double, that will force succeeding portions of the expression to be first promoted to at least double prior to evaluation.

So for instance:

float a,b,c,d;

// In the following, a is first cast to a double, forcing b to be promoted to a double
// This is true for all operations (*/+-%)    
c=(double)a * b;

// However, in the following the order of operations can produce surprising results
d=(double)a + b * c;
// There are implied parentheses around the multiplication, so an equivalent expression is
d=(double)a + (b * c);
// b and c are not promoted until after they are combined
// You may need to cast twice to get your desired results
d=(double)a + ((double)b * c);

And to add some details to the definitions of float, double and long double: in all implementations that I know of, float is indeed 32 bits, double is 64 bits, and they are IEEE754 implementations. 'long float' however is not only implementation dependent, but might also be hardware dependent.

Under Visual Studio / MSC, a long double might be either 80 or 64 bits. If the x87 math coprocess is being used then 80 bits is used to store the x87 internal register value. If SSE registers are being used, Microsoft took a shortcut and made 'double' and 'long double' identical for all practical purposes. Note where phuclv says 'sizeof(float) <= sizeof(double)' this is a place where they are the same size and it is completely legal. And to further confuse things, under Windows the x87 is configured in such a way that only 64 bits are used rather than the fill 80 bits available, so while the long double might specify 80 bits, 16 of them will usually be meaningless.

Using the GNU compiler it is possible to specify a 128 bit float, but even there 'long double' does not mean > 64 bits. It is a special variable type (__float128).

Monied answered 6/9, 2023 at 0:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.