Why are floating point numbers inaccurate?

Asked 20/2, 2014 at 0:39 Answered 20/2, 2014 at 0:39

floating-point language-agnostic precision

260

Why do some numbers lose accuracy when stored as floating point numbers?

For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:

32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875

How can such an apparently simple number be "too big" to express in 64 bits of memory?

Hogle answered 20/2, 2014 at 0:39 Comment(3)

Discussion of this post on Meta – Obellia 12/6, 2014 at 18:36

Refer to is floating math broken – Zippy 23/5, 2017 at 9:12

9.2 is not "too big" to express in 64 bits of memory, it's just simply not exactly equal to any of the 1022*2^52 = 4602678819172646912 predefined values that a binary64 value is allowed to be, so it gets rounded to the nearest one. – Linebreeding 14/9, 2022 at 2:17

320

In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:

5179139571476070 * 2^-49

Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.

9.2 may be simply 92/10, but 10 cannot be expressed as 2ⁿ if n is limited to integer values.

Seeing the Data

First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):

def float_to_bin_parts(number, bits=64):
    if bits == 32:          # single precision
        int_pack      = 'I'
        float_pack    = 'f'
        exponent_bits = 8
        mantissa_bits = 23
        exponent_bias = 127
    elif bits == 64:        # double precision. all python floats are this
        int_pack      = 'Q'
        float_pack    = 'd'
        exponent_bits = 11
        mantissa_bits = 52
        exponent_bias = 1023
    else:
        raise ValueError, 'bits argument must be 32 or 64'
    bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
    return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]

There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.

Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.

When we call that function with our example, 9.2, here's what we get:

>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']

Interpreting the Data

You'll see I've split the return value into three components. These components are:

Sign
Exponent
Mantissa (also called Significand, or Fraction)

Sign

The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.

Exponent

The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2^{(# of bits) - 1} - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).

Mantissa

The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:

6.0221413x10²³

The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:

1.0010011001100110011001100110011001100110011001100110

This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.

When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 2⁵² to move it 52 places to the right:

0.0010011001100110011001100110011001100110011001100110

In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in decimal, but only approximately in binary; for more detail, see: 675539944105574 / 4503599627370496.)

Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.

Recapping the Components

Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2^{(# of bits) - 1} - 1 to get the true exponent
Mantissa (last component): Divide by 2^{(# of bits)} and add 1 to get the true mantissa

Calculating the Number

Putting all three parts together, we're given this binary number:

1.0010011001100110011001100110011001100110011001100110 x 10¹¹

Which we can then convert from binary to decimal:

1.1499999999999999 x 2³ (inexact!)

And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:

9.1999999999999993

Representing as a Fraction

9.2

Now that we've built the number, it's possible to reconstruct it into a simple fraction:

1.0010011001100110011001100110011001100110011001100110 x 10¹¹

Shift mantissa to a whole number:

10010011001100110011001100110011001100110011001100110 x 10^11-110100

Convert to decimal:

5179139571476070 x 2^3-52

Subtract the exponent:

5179139571476070 x 2^-49

Turn negative exponent into division:

5179139571476070 / 2⁴⁹

Multiply exponent:

5179139571476070 / 562949953421312

Which equals:

9.1999999999999993

9.5

>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']

Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.

Assemble the binary scientific notation:

1.0011 x 10¹¹

Shift the decimal point:

10011 x 10^11-100

Subtract the exponent:

10011 x 10^-1

Binary to decimal:

19 x 2^-1

Negative exponent to division:

19 / 2¹

Multiply exponent:

19 / 2

Equals:

9.5

The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary

Hogle answered 20/2, 2014 at 0:39 Comment(7)

There is also a nice tutorial that shows how to go the other way - given a decimal representation of a number, how do you construct the floating point equivalent. The "long division" approach shows very clearly how you end up with a "remainder" after trying to represent the number. Should be added if you want to be truly "canonical" with your answer. – Jadotville 20/2, 2014 at 0:53

If you're talking about Python and floating-point, I'd suggest at least including the Python tutorial in your links: docs.python.org/3.4/tutorial/floatingpoint.html That's supposed to be the one-stop go-to resource for floating-point issues for Python programmers. If it's lacking in some way (and it almost surely is), please do open an issue on the Python bug tracker for updates or changes. – Autosome 20/2, 2014 at 8:27

@Hogle If this gets turned into community wiki, feel free to incorporate my answer into yours. – Alleluia 20/2, 2014 at 19:8

This answer should definitely also link to floating-point-gui.de, as it's probably the best introduction for beginners. IMO, it should even go above "What every computer scientist should know..." -- these days, people who can reasonably comprehend Goldberg's paper usually are already well aware of it. – Sheikh 13/6, 2014 at 18:39

One more further reading: #588504 – Primula 14/3, 2015 at 12:28

"This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal". This is not true. All of these 'number over a power of two' ratios are exact in decimal. Any approximation is only to shorten the decimal number -- for convenience. – Alarm 22/12, 2015 at 14:19

Every number you can represent in binary using a finite amount of digits can also be represented in decimal using a finite number of digits. (comment limits too short for me to make a proof). "This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal" this has to be false. Maybe you mean the other way around? – Colliery 26/6, 2023 at 13:4

This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.

Consider the fraction 2/3

In good-ol' base 10, we typically write it out as something like

0.666...
0.666
0.667

When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)

Number bases

So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then

(2/3)₁₀ = 0.2₃

In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.

To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.

(1/2)₁₀ = 0.5₁₀ = 0.1₂ = 0.1111...₃

Why are floating point numbers inaccurate?

Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.

Alleluia answered 20/2, 2014 at 0:39 Comment(7)

So in other words, base-3 would be perfect for 1/3 just as base-10 is perfect for 1/10. Neither fraction works in base-2 – Hogle 20/2, 2014 at 1:19

@Hogle Yes. And in general, base-N is perfect for any fraction whose denominator is N or a multiple thereof. – Alleluia 20/2, 2014 at 1:20

And this is one reason why some numerical tool boxes keep track of "what was divided by what", and in the process can keep "infinite accuracy" for all rational numbers. Just like physicists like to keep their equations symbolic until the last possible moment, in case factors of π etc cancel out. – Jadotville 20/2, 2014 at 1:39

@Jadotville I've also seen cases where an algorithm that only performs basic arithmetic (ie, preserves rationality of input), determine if the input was (likely) rational, perform the math using normal floating point arithmetic, then re-estimate a rational approximation at the end to fix any rounding errors. In particular Matlab's reduced row echelon form algorithm does this, and it help numerical stability tremendously. – Alleluia 20/2, 2014 at 1:54

@SchighSchagh - interesting, I didn't know that. I do know that numerical stability is something that is not taught sufficiently in these days of double double precision. Which means that many miss learning about the elegance of many beautiful algorithms. I really like algorithms that compute and correct their own errors. – Jadotville 20/2, 2014 at 2:4

Nitpick: integers are rational (with denominator 1) and can be represented exactly in any base. But yes, interesting point about non-integer rational numbers. – Extensive 19/3, 2020 at 0:26

Note: base N is perfect not just for any fraction whose denominator is N or a multiple thereof, but for any fraction whose denominator's prime factors are all prime factors of N. – Cruickshank 13/4, 2021 at 21:36

While all of the other answers are good there is still one thing missing:

It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!

And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.

Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).

But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.

TL;DR

For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.

Cribbing answered 20/2, 2014 at 0:39 Comment(4)

Interestingly, irrational bases do exist. Phinary, for example. – Superman 12/6, 2014 at 16:18

irrational numbers can be (only) represented in their base. For example pi is 10 in base pi – Primula 24/6, 2014 at 14:21

Point remains valid: Some numbers can never be represented no matter the system. You don't gain anything by changing your base because then some other numbers can not be represented anymore. – Surgical 27/6, 2014 at 4:10

All constructible real numbers* can be represented exactly given an appropriate base; the choice of base is in fact infinite for any particular number. Eg, pi is 10 in base-pi, and it is 100 in base-sqrt(pi). In general, x is 10 in base-x, and it is 100 in base-x^(1/2), 1000 in base-x^(1/3), etc. *Non-constructible reals, if you allow for them via your choice of axioms, uhhh yeah shit gets real weird and nobody cares about digits anymore anyway. Regardless of all this, these esoteric bases are not really useful; and there are always irrational numbers regardless of your choice of base. – Alleluia 5/8, 2021 at 21:8

There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).

The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.

You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.

Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.

Powwow answered 20/2, 2014 at 0:39 Comment(0)

Why can we not represent 9.2 in binary floating point?

Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.

A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.

The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2^b5^c).

On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2^b)

Why do computers use this representation?

Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.

It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.

However the big problem is that such a format is a pain to do calculations on. For two reasons.

If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.

Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.

Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.

Cowpea answered 20/2, 2014 at 0:39 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Seeing the Data

Interpreting the Data

Sign

Exponent

Mantissa

Recapping the Components

Calculating the Number

Representing as a Fraction

9.2

9.5

Further reading

Consider the fraction 2/3

Number bases

Why are floating point numbers inaccurate?

Recommended topics

Hot tags