Why does this Java float addition example behave like the mantissa is 24 bits long?
Asked Answered
H

1

2

Intro:

With Java floats, I noticed that when you add 1.0 to a certain range of tiny negative numbers, it equals 1.0. I decided to investigate this and learned a lot about how floats work in my quest to understand. But I ran into a weird wall. I've found that the bit representations of floats make all the math the clearest, so I'll be using that.

tl;dr, it seems like the mantissa has 24 bits of precision (not including the leading implicit 1) when adding/subtracting instead of the expected 23. Or so it seems given the math and the code outputs.

When you take 0b1_01100110_00000000000000000000000 (-1×2-25×1.0) and add 0b0_01111111_00000000000000000000000 (1×20×1.0 or the float bits for 1.0), the answer ends up being 1.0. The former is the negative float I found that results in this strange answer where the tiniest amount smaller Math.nextDown() doesn't (which is 0b1_01100110_00000000000000000000001 btw) The range of numbers from that one all the way up to -0.0f behaves like this.

The math:

For this special number 0b1_01100110_00000000000000000000000, the exponent -25 is the smaller one, so add 25 to match 1.0's 0, and also shift the mantissa to the right by 25 places. We end up with an exponent of 01111111 and a mantissa of 0.00000000000000000000000[01]. The implicit 1 has moved to the right 25 times so I'm showing that it's now 0. Since it can only be 23 digits long, the portion in the []'s is lost. So the new mantissa is truncated to 0.00000000000000000000000. Now, when we do 1.00000000000000000000000 (1.0's mantissa) minus 0.00000000000000000000000 (the new mantissa of the special number), you simply get 1.0's mantissa (1-0=1). So put it all together and you get 0b0_01111111_00000000000000000000000, which is just 1.0.

This loss of information from precision limitations explains why this special number is treated like nothing. But something strange happens when we try another number.

Enter 0b1_01100111_00000000000000000000000 (-1×2-24×1.0). This is a very similar number except the exponent is -24 now. Same process when we add 1.0. Add 24 to the exponent so it matches 1.0's and we end up with 01111111. Also shift the mantissa to the right by 24, and we end up with 0.00000000000000000000000[1]. Here, I would expect the 1 at the end to be dropped since there are already 23 0's, but when you actually run the code, it doesn't seem like it is.

If we continue the math without truncating the mantissa, 1.00000000000000000000000[0] - 0.00000000000000000000000[1] = 0.11111111111111111111[1]. And since the implicit part has to be 1, we shift everything over to the left by 1 giving us 1.11111111111111111111[0]. We also subtract 1 from the exponent giving 01111110 or -1. In other words, it's normalized. The result is 0b0_01111110_11111111111111111111111 which is exactly what the code gives.

The question:

Why then does the code behave as if the mantissa is 24 bits long, when it's normally represented with 23?

Some helpful code to visualize things:

// to visualize the float bits
private String floatToBinaryString(float value) {
    String binaryString = String.format("%32s", Integer.toBinaryString(Float.floatToIntBits(value))).replace(' ', '0');
    return "0b" + binaryString.charAt(0) + "_" + binaryString.substring(1, 9) + "_" + binaryString.substring(9);
}

// The 2^-25 number + 1.0 outputs 0b0_01111111_00000000000000000000000 or 1.0
System.out.println(floatToBinaryString(
    Float.intBitsToFloat(0b1_01100110_00000000000000000000000)
        + Float.intBitsToFloat(0b0_01111111_00000000000000000000000)));

// The 2^-24 number + 1.0 outputs 0b0_01111110_11111111111111111111111 or 0.99999994
System.out.println(floatToBinaryString(
    Float.intBitsToFloat(0b1_01100111_00000000000000000000000)
        + Float.intBitsToFloat(0b0_01111111_00000000000000000000000)));
Hagler answered 12/8, 2024 at 18:25 Comment(0)
O
3

Preliminaries

The preferred term for the fraction portion of a floating-point representation is “significand,” not “mantissa.” “Mantissa” is an old word for the fraction portion of a logarithm. It was adopted for use with floating-point numbers early in their history, but the term used in the IEEE-754 floating-point standard is “significand.“ A significand is linear: When you add to the significand it adds to the value represented (as scaled by the exponent). A mantissa is logarithmic: When you add to a mantissa, it multiplies the value represented.

In the IEEE-754 binary32 format, the significand is 24 bits. In the bit string used to represent a floating-point datum, there is a field that is 23 bits. That field is not the significand (or the mantissa). It provides a large part of the significand, but the full significand is provided by combining that 23-bit field with one bit derived from the exponent field. (That bit is 0 if the exponent field is all zeros and 1 otherwise, disregarding NaNs and infinities.) The binary32 format always behaves as if normal numbers have 24 bits. The 23 bits is an artifact of encoding for storage, not of the actual properties of the numbers.

Arithmetic in the IEEE-754 format is not performed by using solely 24-bit significands. This means the loss of bits you posit in shifting numbers for addition is incorrect. IEEE-754 specifies arithmetic operations to behave as if the exact real-number-arithmetic result were computed and then that result were rounded to the binary32 format. There is no loss of bits or accuracy inside the computation, only at the production of the final result.

Cases

In the binary32 format, the next representable value below 1 (+1.000000000000000000000002•20) is 1−2−24 (+1.111111111111111111111112•2−1). Below, we will need the low bit in the significand of each of these numbers, so observe the lowest bit of 1.000000000000000000000002 is 0 and the lowest bit of 1.111111111111111111111112 is 1.

When we add 1 and −(2−25), the real-number result is of course 1 − 2−25. This is not representable in binary32. The two nearest representable values are those mentioned above, 1 and 1 − 2−24. IEEE-754 specifies several methods by which the real-number result may be rounded to a representable value. Java uses round-to-nearest, ties-to-even. In this case, 1 and 1 − 2−24 are equidistant from 1 − 2−25, so the rule for ties is used. That rule is that the candidate value with the even low bit in its significand is used. 1 has an even low bit, and 1 − 2−24 does not, so 1 is produced. Thus calculating the sum of 1 and 1 − 2−25 in binary32 with round-to-nearest, ties-to-even, produces 1.

When a floating-point implementation is performing this addition, it does not lose bits. It will retain whatever extra bits are necessary to produce the result required by the rules above, or it will be otherwise designed to produce the required result.

When we add 1 and −(2−24), the real-number result is 1 − 2−24. This is representable in binary32, so it is the produced result, no rounding necessary.

Okubo answered 12/8, 2024 at 19:17 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.