Decimal accuracy of binary floating point numbers

Asked 24/6, 2014 at 1:33 Answered 24/6, 2014 at 20:41

floating-point floating-point-precision numerical

I've found this problem in many interview exams, but don't see how to work out the proper solution myself. The problem is:

How many digits of accuracy can be represented by a floating point number represented by two 16-bit words?

The solution is apparently approximately 6 digits.

Where does this come from, and how would you work it out?

Jubal answered 24/6, 2014 at 1:33 Comment(6)

en.wikipedia.org/wiki/IEEE_754-1985 – Mean 24/6, 2014 at 1:36

Proposal: Use one sign bit, two exponent bits, and 29 significand bits. This gets you about nine digits of precision. (It also makes 1+1 equal infinity, but you didn't ask for something that wasn't daft!) – Equilibrant 24/6, 2014 at 14:46

possible duplicate of Why Are Floating Point Numbers Inaccurate? – Chaperone 24/6, 2014 at 20:56

@Mark: no, not a duplicate. The question asks how to calculate the number of significan digits, which is a valid question, IMO. That does not mean the person asking does not know the reasons why floating point values are inaccurate. On the contrary. – Porte 24/6, 2014 at 20:58

@tmyklebu: Good one. ;-) I guess most here (me included) only thought of the usual IEEE types. – Porte 24/6, 2014 at 20:59

While the generic question does not apparently specify IEEE format, in general one would expect single-precision (32-bit) floating point to have on the order of 6 decimal digits of precision. IBM 360 float, eg, had 24 bits of fraction, giving you slightly over 7 digits. Of course as @Equilibrant demonstrates, you can design an "unbalanced" FP format that gives you many more or many less, but normal designs will find a "balance" with about 1/4 exponent and 3/4 fraction. – Ostiary 24/6, 2014 at 21:31

An IEEE-754 floating point number has of a sign bit, some number of bits (e) for the exponent, and some number (m) for the mantissa, the number that's multiplied by 2 to the exponent. The resulting number is of the form

± m × 2^e
eg, 0b1.01 × 2^-0b0100 = 1.25 × 2^-4 = .078125

which is directly analogous to a real number in (decimal) scientific notation,

± m × 10^e
eg 7.8125 × 10^-2.

Just as the number of significant digits in decimal is independent of the exponent part, so too in binary floating point the precision is set entirely by the number of bits in the mantissa. The longer the mantissa, the higher level of precision the number can represent. For a 32-bit floating point number, the IEEE-754 standard sets the number of bits in the mantissa to be 23 bits (+1 plus sign +8 for exponent); for a 64-bit floating point number, it is 52 bits (+1, +11).

Scientific notation has another convention; the mantissa must be between 1 (10⁰) and 10 (10¹). That greatly simplifies making comparisons by making the representation in scientific notation unique - there is only one way of writing a number in scientific notation. That is, 200 is expressed as 2×10², not 20×10¹ or 0.2×10³. Thus one can very quickly compare numbers - any number of the form +xyz×10² is necessarily less than one of the form +abc×10³.

Similarly in binary, the mantissa must be between 1 and 2 (2⁰...2¹). Thus the first bit of the mantissa must be 1; since it is known what its value must be, it doesn't need to be explicitly stored, and so you effectively have 24 and 53 bits of mantissa. (For very small numbers - denormalized numbers - the bit is implicitly a 0, not a 1, but the result is the same).

So as a result, the 24-bit mantissa in a 32-bit floating point number can range from

0b1.00000000000000000000001 ~ 1.00000012
     to
0b1.11111111111111111111111 ~ 1.99999988

that is, the smallest increment above 1 you can get or below 1 you can get is in the 7th decimal place. Another way of looking at it is to consider a numbers near the middle and look at what the spacing is:

0b1.01111111111111111111110 ~ 1.49999976
0b1.01111111111111111111111 ~ 1.49999988
0b1.10000000000000000000000 ~ 1.5
0b1.10000000000000000000001 ~ 1.50000012

so you get spacings of about 1.2 in the seventh decimal place - so you get something less than seven but more than six digits of precision. That spacing varies somewhat through the range of the numbers; also, one is rarely just doing one floating point operation, and these errors due to rounding propagate, so people usually talk about 6 digits of precision. It should be noted too that while the precision depends only on the size of the mantissa, how that "converts" to errors in decimal digits depends somewhat on the exponent as well; you can see that by taking some of those values and multiplying them by powers of two. But six digits of precision is a good rule of thumb.

The wikipedia page does a good job of giving an overview of floating point numbers, and the comprehensive reference is Goldberg's What Every Computer Scientist Should Know About Floating-Point Arithmetic.

Chercherbourg answered 24/6, 2014 at 13:4 Comment(0)

It's quite simple: a 32 bit IEEE-754 float has 23+1 bits for the mantissa (AKA significand, in IEEE-speak). The size of the mantissa more or less determines the number of representable digits.

To get the number of significant digits, simply calculate log₁₀(2²⁴), which is approx. 7.22. (or, if you think that only 23 bits count, as the top bit is fixed anyway, you get log₁₀(2²³), which is approx. 6.92). So in effect, you have about 6-7 significant digits, for normalized values.

The same can be done for 64 bit floating point values (doubles). They have 52 (or 53) bits to store the mantissa, so calculate log₁₀(2⁵²), which is approx. 15.6 (or 15.9 for 53 bits), which gives you about 15 significant digits.

Porte answered 24/6, 2014 at 20:41 Comment(1)

+1 Paradoxally, if you look at C++ std::numeric_limits cplusplus.com/reference/limits/numeric_limits, you'll see that though doubles have less than 16 decimal digits of accuracy, up to 17 decimal digits are required to distinguish their printed decimal representation - max_digits10 int Number of digits (in decimal base) required to ensure that values that differ are always differentiated. – Deponent 26/6, 2014 at 23:46

Recommended topics

Hot tags