Understanding casts from integer to float

Asked 13/5, 2018 at 18:48 Answered 13/5, 2018 at 21:7

Solved c floating-point int precision floating-point-conversion

Could someone explain this weird looking output on a 32 bit machine?

#include <stdio.h>

int main() {
  printf("16777217 as float is %.1f\n",(float)16777217);
  printf("16777219 as float is %.1f\n",(float)16777219);

  return 0;
}

Output

16777217 as float is 16777216.0
16777219 as float is 16777220.0

The weird thing is that 16777217 casts to a lower value and 16777219 casts to a higher value...

Onomatopoeia answered 13/5, 2018 at 18:48 Comment(15)

I assume since you've picked very specific numbers that you know a float only has 24 bits used to store an exact number. Beyond that you're limited to what can be stored as exact binary using the exponent + fraction bits. If you need to store a larger number you need to accept the precision loss. – Jetblack 13/5, 2018 at 18:54

@Yunnosch not really IMO. – Frankenstein 13/5, 2018 at 18:59

@WeatherVane Your link is admittedly much better. My point is that knowing the mechanisms of a float type makes the question unneeded; i.e. one cast up and one cast down is not surprising at all. – Tympanist 13/5, 2018 at 19:0

@Tympanist it's not about a certain set of discrete fractional values that can be exactly stored, in this case. – Frankenstein 13/5, 2018 at 19:1

possible duplicate #588504 – Blackpool 13/5, 2018 at 19:11

@Weather Vane I understand that it won't be exactly represented, but why 16777217 is being cast to 16777216 and not to 16777218 ? – Onomatopoeia 13/5, 2018 at 19:14

There is a very good answer which explains that. – Frankenstein 13/5, 2018 at 19:15

@JacekCz not that doesn't work for every floating point question. – Boony 13/5, 2018 at 19:17

possible duplicate stackoverflow.com/questions/23420783 – Hatcher 13/5, 2018 at 19:26

@Hatcher good find. I'm not sure I want to hammer that one, though. – Boony 13/5, 2018 at 19:29

@Jean-FrançoisFabre Yup, I could hammer it myself, but I'd rather allow others to vote on the matter. – Hatcher 13/5, 2018 at 19:31

It's definitely related, but doesn't directly answer OP question. – Boony 13/5, 2018 at 19:32

@Hatcher JFF is correct, that also has undefined behaviour in it. – Malka 13/5, 2018 at 19:33

ok, so if 5 people don't use their hammer, it means that the question is good & new, then :) – Boony 13/5, 2018 at 19:35

@Jean-FrançoisFabre it doesn't mean that. It is just that no one has presented a viable dupe. – Malka 13/5, 2018 at 19:50

In the IEEE-754 basic 32-bit binary floating-point format, all integers from −16,777,216 to +16,777,216 are representable. From 16,777,216 to 33,554,432, only even integers are representable. Then, from 33,554,432 to 67,108,864, only multiples of four are representable. (Since the question does not necessitate discussion of which numbers are representable, I will omit explanation and just take this for granted.)

The most common default rounding mode is to round the exact mathematical result to the nearest representable value and, in case of a tie, to round to the representable value which has zero in the low bit of its significand.

16,777,217 is equidistant between the two representable values 16,777,216 and 16,777,218. These values are represented as 100000000000000000000000₂•2¹ and 100000000000000000000001₂•2¹. The former has 0 in the low bit of its significand, so it is chosen as the result.

16,777,219 is equidistant between the two representable values 16,777,218 and 16,777,220. These values are represented as 100000000000000000000001₂•2¹ and 100000000000000000000010₂•2¹. The latter has 0 in the low bit of its significand, so it is chosen as the result.

Zak answered 13/5, 2018 at 19:7 Comment(0)

You may have heard of the concept of "precision", as in "this fractional representation has 3 digits of precision".

This is very easy to think about in a fixed-point representation. If I have, say, three digits of precision past the decimal, then I can exactly represent 1/2 = 0.5, and I can exactly represent 1/4 = 0.25, and I can exactly represent 1/8 = 0.125, but if I try to represent 1/16, I can not get 0.0625; I will either have to settle for 0.062 or 0.063.

But that's for fixed-point. The computer you're using uses floating-point, which is a lot like scientific notation. You get a certain number of significant digits total, not just digits to the right of the decimal point. For example, if you have 3 decimal digits worth of precision in a floating-point format, you can represent 0.123 but not 0.1234, and you can represent 0.0123 and 0.00123, but not 0.01234 or 0.001234. And if you have digits to the left of the decimal point, those take away away from the number you can use to the right of the decimal point. You can use 1.23 but not 1.234, and 12.3 but not 12.34, and 123.0 but not 123.4 or 123.anythingelse.

And -- you can probably see the pattern by now -- if you're using a floating-point format with only three significant digits, you can't represent all numbers greater than 999 perfectly accurately at all, even though they don't have a fractional part. You can represent 1230 but not 1234, and 12300 but not 12340.

So that's decimal floating-point formats. Your computer, on the other hand, uses a binary floating-point format, which ends up being somewhat trickier to think about. We don't have an exact number of decimal digits' worth of precision, and the numbers that can't be exactly represented don't end up being nice even multiples of 10 or 100.

In particular, type float on most machines has 24 binary bits worth of precision, which works out to 6-7 decimal digits' worth of precision. That's obviously not enough for numbers like 16777217.

So where did the numbers 16777216 and 16777220 come from? As Eric Postpischil has already explained, it ends up being because they're multiples of 2. If we look at the binary representations of nearby numbers, the pattern becomes clear:

16777208     111111111111111111111000
16777209     111111111111111111111001
16777210     111111111111111111111010
16777211     111111111111111111111011
16777212     111111111111111111111100
16777213     111111111111111111111101
16777214     111111111111111111111110
16777215     111111111111111111111111
16777216    1000000000000000000000000
16777218    1000000000000000000000010
16777220    1000000000000000000000100

16777215 is the biggest number that can be represented exactly in 24 bits. After that, you can represent only even numbers, because the low-order bit is the 25th, and essentially has to be 0.

Sparkman answered 13/5, 2018 at 19:13 Comment(0)

Type float cannot hold that much significance. The significand can only hold 24 bits. Of those 23 are stored and the 24th is 1 and not stored, because the significand is normalised.

Please read this which says "Integers in [ − 16777216 , 16777216 ] can be exactly represented", but yours are out of that range.

Frankenstein answered 13/5, 2018 at 18:57 Comment(1)

This does not explain why the results are 16,777,216 and 16,777,220 rather than 16,777,218 and 16,777,220 or any other numbers. – Zak 13/5, 2018 at 19:2

Floating representation follows a method similar to what we use in everyday life and we call exponential representation. This is a number using a number of digits that we decide will suffice to realistically represent the value, we call it mantissa, or significant, that we will multiply to a base, or radix, value elevated to a power that we call exponent. In plain words:

num*base^exp

We generally use 10 as base, because we have 10 finger in our hands, so we are habit to numbers like 1e2, which is 100=1*10^2.

Of course we regret to use exponential representation for so small numbers, but we prefer to use it when acting on very large numbers, or, better, when our number has a number of digits that we consider enough to represent the entity we are valorizing.

The correct number of digits could be how many we can handle by mind, or what are required for an engineering application. When we decided how many digits we need we will not care anymore for how adherent to the real value will be the numeric representation we are going to handle. I.e. for a number like 123456.789e5 it is understood that adding up 99 unit we can tolerate the rounded representation and consider it acceptable anyway, if not we should change the representation and use a different one with appropriate number of digits as in 12345678900.

On a computer when you have to handle very large numbers, that couldn't fit in a standard integer, or when the you have to represent a real number (with decimal part) the right choice is a floating or double floating point representation. It uses the same layout we discussed above, but the base is 2 instead of 10. This because a computer can have only 2 fingers, the states 0 or 1. Se the formula we used before, to represent 100, become:

100100*2^0

That's still isn't the real floating point representation, but gives the idea. Now consider that in a computer the floating point format is standardized and for a standard float, as per IEE-754, it uses, as memory layout (we will see after why it is assumed 1 more bit for the mantissa), 23bits for the mantissa, 1bit for the sign and 8bits for the exponent biased by -127 (that simply means that it will range between -126 and +127 without the need for a sign bit, and the values 0x00 and 0xff reserved for special meaning).

Now consider using 0 as exponent, this means that the value 2^exponent=2^0=1 multiplied by mantissa give the same behavior of a 23bits integer. This imply that incrementing a count as in:

float f = 0;
while(1)
{
    f +=1;
    printf ("%f\n", f);
}

You will see that the printed value linearly increase by one until it saturates the 23bits and the exponent will become to grow.

If the base, or radix, of our floating point number would have been 10, we would see an increase each 10 loops for the first 100 (10^2) values, than an increase of 100 for the next 1000 (10^3) values and so on. You see that this corresponds to the *truncation** we have to make due to the limited number of available digits.

The same phenomenon will be observed when using the binary base, only the changes happens on powers of 2 interval.

What we discussed up to now is called the denormalized form of a floating point, what is normally used is the counterpart normalized. The latter simply means that there is a 24th bit, not stored, that is always 1. In plane words we wouldn't use an exponent of 0 for number less that 2^24, but we shift it (multiply by 2) up to the MSbit==1 reach the 24th bit, than the exponent is adjusted to such a negative value that force the conversion to shift back the number to its original value.

Remember the reserved value of the exponent we talked above? Well an exponent==0x00 means that we have a denormalized number. exponent==0xff indicate a nan (not-a-number) or +/-infinity if mantissa==0.

It should be clear now that when the number we express is beyond the 24bits of the significant (mantissa), we should expect approximation of the real value depending on how much far we are from 2^24.

Now the number you are using are just on the edge of 2^24=16,277,216 :

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|0|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| = 16,277,215
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 s\______ _______/\_____________________ _______________________/
 i       v                              v
 g   exponent                        mantissa
 n

Now increasing by 1 we have:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| = 16,277,216
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 s\__ exponent __/\_________________ mantissa __________________/

Note that we have triggered to 1 the 24th bit, but from now on we are above the 24 bit representation, and each possible further representation is in steps of 2^1=2. Simply advance by 2 or can represent only even numbers (multiples of 2^1=2). I.e. setting to 1 the Less Significant bit we have:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1| = 16,277,218
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 s\__ exponent __/\_________________ mantissa __________________/

Increasing again:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0| = 16,277,220
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 s\__ exponent __/\_________________ mantissa __________________/

As you can see we cannot exactly represent 16,277,219. In your code:

// This will print 16777216, because 1 increment isn't enough to
// increase the significant that can express only intervals
// that are > 2^1
printf("16777217 as float is %.1f\n",(float)16777217);
// This will print 16777220, because an increment of 3 on
// the base 16777216=2^24 will trigger an exponent increase rounded
// to the closer exact representation
printf("16777219 as float is %.1f\n",(float)16777219);

As said above the choice of the numeric format must be appropriate for the usage, a floating point is only an approximate representation of a real number, and is definitively our duty to carefully use the right type.

In the case if we need more precision we could use a double, or an integer long long int.

Just for sake of completeness I would add few words on the approximate representation for irriducible numbers. This numbers are not divisible by a fraction of 2, so the representation in float format will always be not exact, and need to be rounded to the correct value during conversion to decimal representation.

For more details see:

Online demo applets:

Doddered answered 13/5, 2018 at 21:7 Comment(0)

Recommended topics

Hot tags