Why does the bit-width of the mantissa of a floating point represent twice as many numbers compared to an int?

The format of the double (64 bits) is as follows:

1 bit: sign
11 bits: exponent
52 bits: mantissa

We only need to look at the positive integers we can represent with the mantissa, since the sign bit takes care of the negative integers for us.

Naively, with 52 bits, we can store unsigned integers from 0 to 2^52 - 1. With a sign bit, this lets us store from -2^52 - 1 to 2^52 - 1.

However, we have a little trick we can use. We say that the first digit of our integer is always a 1, which gives us an extra bit to work with.

To see why this works, let's dig a little deeper.

Every positive integer will have at least one 1 in its binary representation. So, we shift the mantissa left or right until we get a 1 at the start, using the exponent. An example might help here:

9, represented as an unsigned int: 000...0001001 (dots representing more 0s).

Written another way: 1.001 * 2^3. (1.001 being in binary, not decimal.)

And, we'll agree to never use a 0 as the first bit. So even though we could write 9 as 0.1001 * 2^4 or 0.01001 * 2^5, we won't. We'll agree that when we write the numbers out in this format, we'll always make sure we use the exponent to "shift" bits over until we start with a 1.

So, the information we need to store to get 9 is as follows:

e: 3
i: 1.001

But if i always starts with 1, why bother writing it out every time? Let's just keep the following instead:

e: 3
i: 001

Using precisely this information, we can reconstruct the number as: 1.i * 2^e == 9.

When we get up to larger numbers, our "i" will be bigger, maybe up to 52 bits used, but we're actually storing 53 bits of information because of the leading 1 we always have.

Final Note: This is not quite what is stored in the exponent and mantissa of a double, I've simplified things to help explain, but hopefully this helps people understand where the missing bit is from. Also, this does not cover 0, which gets a special representation (the trick used above will not work for 0, since the usual representation of 0 doesn't have any 1s in it).

Recommended topics

Hot tags