Printing double without losing precision
Asked Answered
R

8

36

How do you print a double to a stream so that when it is read in you don't lose precision?

I tried:

std::stringstream ss;

double v = 0.1 * 0.1;
ss << std::setprecision(std::numeric_limits<T>::digits10) << v << " ";

double u;
ss >> u;
std::cout << "precision " << ((u == v) ? "retained" : "lost") << std::endl;

This did not work as I expected.

But I can increase precision (which surprised me as I thought that digits10 was the maximum required).

ss << std::setprecision(std::numeric_limits<T>::digits10 + 2) << v << " ";
                                                 //    ^^^^^^ +2

It has to do with the number of significant digits and the first two don't count in (0.01).

So has anybody looked at representing floating point numbers exactly? What is the exact magical incantation on the stream I need to do?

After some experimentation:

The trouble was with my original version. There were non-significant digits in the string after the decimal point that affected the accuracy.

So to compensate for this we can use scientific notation to compensate:

ss << std::scientific
   << std::setprecision(std::numeric_limits<double>::digits10 + 1)
   << v;

This still does not explain the need for the +1 though.

Also if I print out the number with more precision I get more precision printed out!

std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10 + 1) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits) << v << "\n";

It results in:

1.000000000000000e-02
1.0000000000000002e-02
1.00000000000000019428902930940239457413554200000000000e-02

Based on @Stephen Canon answer below:

We can print out exactly by using the printf() formatter, "%a" or "%A". To achieve this in C++ we need to use the fixed and scientific manipulators (see n3225: 22.4.2.2.2p5 Table 88)

std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
std::cout << v;

For now I have defined:

template<typename T>
std::ostream& precise(std::ostream& stream)
{
    std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
    return stream;
}

std::ostream& preciselngd(std::ostream& stream){ return precise<long double>(stream);}
std::ostream& precisedbl(std::ostream& stream) { return precise<double>(stream);}
std::ostream& preciseflt(std::ostream& stream) { return precise<float>(stream);}

Next: How do we handle NaN/Inf?

Restate answered 19/1, 2011 at 17:49 Comment(6)
Why are you including a space after v when outputting to ss?Adrastus
@chrisaycock: No reason. Cut and paste bug.Restate
There will always be a loss of precision, except those values that have a binary denomenator. The question should be How much precision must be maintained?Verada
@Thomas Matthews: I have yet to be convinced that is true. I see no reason that you not precisely store a double and retrieve it. If you have some reference that shows this I would love to read it. Its not as if a floating point number is a fuzzy value it has an exact value. Maybe I need to print a binary version (not sure yet) but I still think it can be done in base 10.Restate
@Martin York: According to many of the floating point issues on SO, there is a loss of precision when converting from a non-binary fraction to internal representation. This is what I am referring to. Knowing there will be a loss, the objective should be to minimize the loss.Verada
@Rüdiger Stevens: the reason for the differences in results due to calculations in fpu (x86) and sse is not due to different standards - they both conform to iEEE754 - but because the former stores intermediate results in a 10-byte format with 64 significant digits in the mantissa and the latter in the standard double precision (sometimes single precision) with 1+52 (1+22) digits. Results with many significant digits will be handled better by fpus than by sses.Mechelle
C
14

Don't print floating-point values in decimal if you don't want to lose precision. Even if you print enough digits to represent the number exactly, not all implementations have correctly-rounded conversions to/from decimal strings over the entire floating-point range, so you may still lose precision.

Use hexadecimal floating point instead. In C:

printf("%a\n", yourNumber);

C++0x provides the hexfloat manipulator for iostreams that does the same thing (on some platforms, using the std::hex modifier has the same result, but this is not a portable assumption).

Using hex floating point is preferred for several reasons.

First, the printed value is always exact. No rounding occurs in writing or reading a value formatted in this way. Beyond the accuracy benefits, this means that reading and writing such values can be faster with a well tuned I/O library. They also require fewer digits to represent values exactly.

Circus answered 19/1, 2011 at 19:58 Comment(4)
It this type specifier present in all runtimes? I have in in Visual C++, but some references don't have it. cplusplus.com/reference/clibrary/cstdio/printfChampaigne
The %a specifier has been in the C standard for 11 years; any platform that still doesn't support it can't really claim to be "C". hexfloat was added in C++0x (I believe -- I'm not a C++ guy), so its use may be somewhat less portable.Circus
You can get the %a formatter by specifying fixed and scientific formatting.Restate
in Visual Studio 2013 you can print it but you can't scan it: #55931226Doura
M
17

It's not correct to say "floating point is inaccurate", although I admit that's a useful simplification. If we used base 8 or 16 in real life then people around here would be saying "base 10 decimal fraction packages are inaccurate, why did anyone ever cook those up?".

The problem is that integral values translate exactly from one base into another, but fractional values do not, because they represent fractions of the integral step and only a few of them are used.

Floating point arithmetic is technically perfectly accurate. Every calculation has one and only one possible result. There is a problem, and it is that most decimal fractions have base-2 representations that repeat. In fact, in the sequence 0.01, 0.02, ... 0.99, only a mere 3 values have exact binary representations. (0.25, 0.50, and 0.75.) There are 96 values that repeat and therefore are obviously not represented exactly.

Now, there are a number of ways to write and read back floating point numbers without losing a single bit. The idea is to avoid trying to express the binary number with a base 10 fraction.

  • Write them as binary. These days, everyone implements the IEEE-754 format so as long as you choose a byte order and write or read only that byte order, then the numbers will be portable.
  • Write them as 64-bit integer values. Here you can use the usual base 10. (Because you are representing the 64-bit aliased integer, not the 52-bit fraction.)

You can also just write more decimal fraction digits. Whether this is bit-for-bit accurate will depend on the quality of the conversion libraries and I'm not sure I would count on perfect accuracy (from the software) here. But any errors will be exceedingly small and your original data certainly has no information in the low bits. (None of the constants of physics and chemistry are known to 52 bits, nor has any distance on earth ever been measured to 52 bits of precision.) But for a backup or restore where bit-for-bit accuracy might be compared automatically, this obviously isn't ideal.

Moravian answered 19/1, 2011 at 18:30 Comment(2)
@MSN: No, "correct" is, well, correct. Floating point is often inaccurate, but so are many integer algorithms. It is entirely possible to write exact algorithms in floating point (in fact, that's a large part of what I get paid to do).Circus
@Stephen, well, it's also irrelevant to the discussion. I guess it would be more "correct" to say the context defines accuracy, not the representation.Tombaugh
C
14

Don't print floating-point values in decimal if you don't want to lose precision. Even if you print enough digits to represent the number exactly, not all implementations have correctly-rounded conversions to/from decimal strings over the entire floating-point range, so you may still lose precision.

Use hexadecimal floating point instead. In C:

printf("%a\n", yourNumber);

C++0x provides the hexfloat manipulator for iostreams that does the same thing (on some platforms, using the std::hex modifier has the same result, but this is not a portable assumption).

Using hex floating point is preferred for several reasons.

First, the printed value is always exact. No rounding occurs in writing or reading a value formatted in this way. Beyond the accuracy benefits, this means that reading and writing such values can be faster with a well tuned I/O library. They also require fewer digits to represent values exactly.

Circus answered 19/1, 2011 at 19:58 Comment(4)
It this type specifier present in all runtimes? I have in in Visual C++, but some references don't have it. cplusplus.com/reference/clibrary/cstdio/printfChampaigne
The %a specifier has been in the C standard for 11 years; any platform that still doesn't support it can't really claim to be "C". hexfloat was added in C++0x (I believe -- I'm not a C++ guy), so its use may be somewhat less portable.Circus
You can get the %a formatter by specifying fixed and scientific formatting.Restate
in Visual Studio 2013 you can print it but you can't scan it: #55931226Doura
C
11

I got interested in this question because I'm trying to (de)serialize my data to & from JSON.

I think I have a clearer explanation (with less hand waiving) for why 17 decimal digits are sufficient to reconstruct the original number losslessly:

enter image description here

Imagine 3 number lines:
1. for the original base 2 number
2. for the rounded base 10 representation
3. for the reconstructed number (same as #1 because both in base 2)

When you convert to base 10, graphically, you choose the tic on the 2nd number line closest to the tic on the 1st. Likewise when you reconstruct the original from the rounded base 10 value.

The critical observation I had was that in order to allow exact reconstruction, the base 10 step size (quantum) has to be < the base 2 quantum. Otherwise, you inevitably get the bad reconstruction shown in red.

Take the specific case of when the exponent is 0 for the base2 representation. Then the base2 quantum will be 2^-52 ~= 2.22 * 10^-16. The closest base 10 quantum that's less than this is 10^-16. Now that we know the required base 10 quantum, how many digits will be needed to encode all possible values? Given that we're only considering the case of exponent = 0, the dynamic range of values we need to represent is [1.0, 2.0). Therefore, 17 digits would be required (16 digits for fraction and 1 digit for integer part).

For exponents other than 0, we can use the same logic:

    exponent    base2 quant.   base10 quant.  dynamic range   digits needed
    ---------------------------------------------------------------------
    1              2^-51         10^-16         [2, 4)           17
    2              2^-50         10^-16         [4, 8)           17
    3              2^-49         10^-15         [8, 16)          17
    ...
    32             2^-20         10^-7        [2^32, 2^33)       17
    1022          9.98e291      1.0e291    [4.49e307,8.99e307)   17

While not exhaustive, the table shows the trend that 17 digits are sufficient.

Hope you like my explanation.

Casilde answered 24/11, 2012 at 9:38 Comment(1)
Thanks for the up votes. I did get some further insight. The base 10 quantum has to be <= the base 2 quantum because that's the only way it guarantees that for every point on the base 2 number line, the closest base 10 tick is within 1/2 a step! That ensures an exact conversion.Casilde
P
9

In C++20 you'll be able to use std::format to do this:

std::stringstream ss;
double v = 0.1 * 0.1;
ss << std::format("{}", v);
double u;
ss >> u;
assert(v == u);

The default floating-point format is the shortest decimal representation with a round-trip guarantee. The advantage of this method compared to using the precision of max_digits10 (not digits10 which is not suitable for round trip through decimal) from std::numeric_limits is that it doesn't print unnecessary digits.

In the meantime you can use the {fmt} library, std::format is based on. For example (godbolt):

fmt::print("{}", 0.1 * 0.1);

Output (assuming IEEE754 double):

0.010000000000000002

{fmt} uses the Dragonbox algorithm for fast binary floating point to decimal conversion. In addition to giving the shortest representation it is 20-30x faster than common standard library implementations of printf and iostreams.

Disclaimer: I'm the author of {fmt} and C++20 std::format.

Projection answered 17/12, 2020 at 23:6 Comment(3)
"it doesn't print unnecessary digits" -- then why would the output not be "0.01" ?Doura
@PatrickParker, because it would loose precision which is not what the OP wants. g uses the default precision of 6 as printf which is why it's not suitable here.Projection
Thanks. I was overlooking something obvious in that the arithmetic operation had introduced that small deviation.Doura
C
7

A double has the precision of 52 binary digits or 15.95 decimal digits. See http://en.wikipedia.org/wiki/IEEE_754-2008. You need at least 16 decimal digits to record the full precision of a double in all cases. [But see fourth edit, below].

By the way, this means significant digits.

Answer to OP edits:

Your floating point to decimal string runtime is outputing way more digits than are significant. A double can only hold 52 bits of significand (actually, 53, if you count a "hidden" 1 that is not stored). That means the the resolution is not more than 2 ^ -53 = 1.11e-16.

For example: 1 + 2 ^ -52 = 1.0000000000000002220446049250313 . . . .

Those decimal digits, .0000000000000002220446049250313 . . . . are the smallest binary "step" in a double when converted to decimal.

The "step" inside the double is:

.0000000000000000000000000000000000000000000000000001 in binary.

Note that the binary step is exact, while the decimal step is inexact.

Hence the decimal representation above,

1.0000000000000002220446049250313 . . .

is an inexact representation of the exact binary number:

1.0000000000000000000000000000000000000000000000000001.

Third Edit:

The next possible value for a double, which in exact binary is:

1.0000000000000000000000000000000000000000000000000010

converts inexactly in decimal to

1.0000000000000004440892098500626 . . . .

So all of those extra digits in the decimal are not really significant, they are just base conversion artifacts.

Fourth Edit:

Though a double stores at most 16 significant decimal digits, sometimes 17 decimal digits are necessary to represent the number. The reason has to do with digit slicing.

As I mentioned above, there are 52 + 1 binary digits in the double. The "+ 1" is an assumed leading 1, and is neither stored nor significant. In the case of an integer, those 52 binary digits form a number between 0 and 2^53 - 1. How many decimal digits are necessary to store such a number? Well, log_10 (2^53 - 1) is about 15.95. So at most 16 decimal digits are necessary. Let's label these d_0 to d_15.

Now consider that IEEE floating point numbers also have an binary exponent. What happens when we increment the exponet by, say, 2? We have multiplied our 52-bit number, whatever it was, by 4. Now, instead of our 52 binary digits aligning perfectly with our decimal digits d_0 to d_15, we have some significant binary digits represented in d_16. However, since we multiplied by something less than 10, we still have significant binary digits represented in d_0. So our 15.95 decimal digits now occuply d_1 to d_15, plus some upper bits of d_0 and some lower bits of d_16. This is why 17 decimal digits are sometimes needed to represent a IEEE double.

Fifth Edit

Fixed numerical errors

Champaigne answered 19/1, 2011 at 18:3 Comment(1)
When I use scientific notation and a precision this works exactly as you describe. (numeric_limits<double>::digits10 + 1) == 16. And in my original code this indicates no precision lost. But when I print out with 53 digits it indicates that there is more precision than I was using (see edit above). I don't understand the discrepancy.Restate
L
4

The easiest way (for IEEE 754 double) to guarantee a round-trip conversion is to always use 17 significant digits. But that has the disadvantage of sometimes including unnecessary noise digits (0.1 → "0.10000000000000001").

An approach that's worked for me is to sprintf the number with 15 digits of precision, then check if atof gives you back the original value. If it doesn't, try 16 digits. If that doesn't work, use 17.

You might want to try David Gay's algorithm (used in Python 3.1 to implement float.__repr__).

Lusatia answered 20/1, 2011 at 1:22 Comment(1)
There is an interesting anomaly in the "try 15 then 16 then 17" process that may skip a round-tripping 16-digit string -- see my article exploringbinary.com/…Assessment
M
3

Thanks to ThomasMcLeod for pointing out the error in my table computation

To guarantee round-trip conversion using 15 or 16 or 17 digits is only possible for a comparatively few cases. The number 15.95 comes from taking 2^53 (1 implicit bit + 52 bits in the significand/"mantissa") which comes out to an integer in the range 10^15 to 10^16 (closer to 10^16).

Consider a double precision value x with an exponent of 0, i.e. it falls into the floating point range range 1.0 <= x < 2.0. The implicit bit will mark the 2^0 component (part) of x. The highest explicit bit of the significand will denote the next lower exponent (from 0) <=> -1 => 2^-1 or the 0.5 component.

The next bit 0.25, the ones after 0.125, 0.0625, 0.03125, 0.015625 and so on (see table below). The value 1.5 will thus be represented by two components added together: the implicit bit denoting 1.0 and the highest explicit significand bit denoting 0.5.

This illustrates that from the implicit bit downward you have 52 additional, explicit bits to represent possible components where the smallest is 0 (exponent) - 52 (explicit bits in significand) = -52 => 2^-52 which according to the table below is ... well you can see for yourselves that it comes out to quite a bit more than 15.95 significant digits (37 to be exact). To put it another way the smallest number in the 2^0 range that is != 1.0 itself is 2^0+2^-52 which is 1.0 + the number next to 2^-52 (below) = (exactly) 1.0000000000000002220446049250313080847263336181640625, a value which I count as being 53 significant digits long. With 17 digit formatting "precision" the number will display as 1.0000000000000002 and this would depend on the library converting correctly.

So maybe "round-trip conversion in 17 digits" is not really a concept that is valid (enough).

2^ -1 = 0.5000000000000000000000000000000000000000000000000000
2^ -2 = 0.2500000000000000000000000000000000000000000000000000
2^ -3 = 0.1250000000000000000000000000000000000000000000000000
2^ -4 = 0.0625000000000000000000000000000000000000000000000000
2^ -5 = 0.0312500000000000000000000000000000000000000000000000
2^ -6 = 0.0156250000000000000000000000000000000000000000000000
2^ -7 = 0.0078125000000000000000000000000000000000000000000000
2^ -8 = 0.0039062500000000000000000000000000000000000000000000
2^ -9 = 0.0019531250000000000000000000000000000000000000000000
2^-10 = 0.0009765625000000000000000000000000000000000000000000
2^-11 = 0.0004882812500000000000000000000000000000000000000000
2^-12 = 0.0002441406250000000000000000000000000000000000000000
2^-13 = 0.0001220703125000000000000000000000000000000000000000
2^-14 = 0.0000610351562500000000000000000000000000000000000000
2^-15 = 0.0000305175781250000000000000000000000000000000000000
2^-16 = 0.0000152587890625000000000000000000000000000000000000
2^-17 = 0.0000076293945312500000000000000000000000000000000000
2^-18 = 0.0000038146972656250000000000000000000000000000000000
2^-19 = 0.0000019073486328125000000000000000000000000000000000
2^-20 = 0.0000009536743164062500000000000000000000000000000000
2^-21 = 0.0000004768371582031250000000000000000000000000000000
2^-22 = 0.0000002384185791015625000000000000000000000000000000
2^-23 = 0.0000001192092895507812500000000000000000000000000000
2^-24 = 0.0000000596046447753906250000000000000000000000000000
2^-25 = 0.0000000298023223876953125000000000000000000000000000
2^-26 = 0.0000000149011611938476562500000000000000000000000000
2^-27 = 0.0000000074505805969238281250000000000000000000000000
2^-28 = 0.0000000037252902984619140625000000000000000000000000
2^-29 = 0.0000000018626451492309570312500000000000000000000000
2^-30 = 0.0000000009313225746154785156250000000000000000000000
2^-31 = 0.0000000004656612873077392578125000000000000000000000
2^-32 = 0.0000000002328306436538696289062500000000000000000000
2^-33 = 0.0000000001164153218269348144531250000000000000000000
2^-34 = 0.0000000000582076609134674072265625000000000000000000
2^-35 = 0.0000000000291038304567337036132812500000000000000000
2^-36 = 0.0000000000145519152283668518066406250000000000000000
2^-37 = 0.0000000000072759576141834259033203125000000000000000
2^-38 = 0.0000000000036379788070917129516601562500000000000000
2^-39 = 0.0000000000018189894035458564758300781250000000000000
2^-40 = 0.0000000000009094947017729282379150390625000000000000
2^-41 = 0.0000000000004547473508864641189575195312500000000000
2^-42 = 0.0000000000002273736754432320594787597656250000000000
2^-43 = 0.0000000000001136868377216160297393798828125000000000
2^-44 = 0.0000000000000568434188608080148696899414062500000000
2^-45 = 0.0000000000000284217094304040074348449707031250000000
2^-46 = 0.0000000000000142108547152020037174224853515625000000
2^-47 = 0.0000000000000071054273576010018587112426757812500000
2^-48 = 0.0000000000000035527136788005009293556213378906250000
2^-49 = 0.0000000000000017763568394002504646778106689453125000
2^-50 = 0.0000000000000008881784197001252323389053344726562500
2^-51 = 0.0000000000000004440892098500626161694526672363281250
2^-52 = 0.0000000000000002220446049250313080847263336181640625
Mechelle answered 31/1, 2011 at 13:57 Comment(7)
First, the conversion math is not correct. For example, 2^-7 is 0.0078125, not 0.0070125 as you have posted. Second, even if the digits on the last line were correct, they are not significant. They are base conversion artifacts. See my post above.Champaigne
@ThomasMcLeod: thank you for pointing out the errors. As to your statement "they are not significant" I beg to differ. In an overwhelming majority of cases they won't be significant but in a few they will. My post attempted to point out the complexities of rounding and conversion by showing the number of digits actually involved.Mechelle
@Olof, how do you define significant? If we divide 1 by 3, we have 0.3333333333333333 ... , but that does not mean that we have infinite significant digits. Basic rule: the result of a mathematical operation can never have more significant digits than the number of significant digits of any numerical input to that operation.Champaigne
@ThomasMcLeod: basic rule or basic assumption? How many significant digits in pi or e or for GPS? My impression is that a significant (haha!) fraction of questions related to FP have to do with differing not-very-significant digits at the end of a calculation and an ambition to achieve precise or exact results leading to a frustration as to why they weren't achieved. Without a relatively deep dive into basics of FP we'll be stuck in situations where someone asks "how many digits," someone answers "16" and then "the error is now smaller ... but not gone - will 17 work?"Mechelle
@Olof, pi and e are irrational and by definition have infinite significant digits in any rational base. However, the appoximate representation of pi and e in a double is rational, as is every number with a finite significand. The point here is that, when you convert from one base to another, you do not and cannot increase the significant digits (fp precision) of a number. To do so would require an increse to the information carrying capacity of fixed-length data, which is impossible. I challenge you to find a scenario where more than 17 digits are required for round-trip conversion.Champaigne
@ThomasMcLeod: my point is that quite a few developers wonder why they can't do a round-trip conversion and end up with the original value because they are not aware enough of how floating point works. It may well be that their scenario does not require 17 digits of precision but they will eventually reach it in search of the "accurate conversion." Voyager space travel? South Korean National accounting? Indonesian National accounting?Mechelle
@ThomasMcLeod: If one measured a resistance as 3.00 ohms, the amount of current that would flow through it with exactly one volt applied would be 0.333 amps (not infinite precision, since we can't tell with certainty that the value is greater than 0.3332 or less than .3334) but the fraction one-third is 0.333333... with infinite precision, since no matter how many digits one writes, one can tell that making the next digit a 4 would yield a value that was erroneously larger than 1/3, and making it a 2 would yield a value that was erroneously smaller than 1/3.Mort
I
0

@ThomasMcLeod: I think the significant digit rule comes from my field, physics, and means something more subtle:

If you have a measurement that gets you the value 1.52 and you cannot read any more detail off the scale, and say you are supposed to add another number (for example of another measurement because this one's scale was too small) to it, say 2, then the result (obviously) has only 2 decimal places, i.e. 3.52. But likewise, if you add 1.1111111111 to the value 1.52, you get the value 2.63 (and nothing more!).

The reason for the rule is to prevent you from kidding yourself into thinking you got more information out of a calculation than you put in by the measurement (which is impossible, but would seem that way by filling it with garbage, see above).

That said, this specific rule is for addition only (for addition: the error of the result is the sum of the two errors - so if you measure just one badly, though luck, there goes your precision...).

How to get the other rules: Let's say a is the measured number and δa the error. Let's say your original formula was: f:=m a Let's say you also measure m with error δm (let that be the positive side). Then the actual limit is: f_up=(m+δm) (a+δa) and f_down=(m-δm) (a-δa) So, f_up =m a+δm δa+(δm a+m δa) f_down=m a+δm δa-(δm a+m δa) Hence, now the significant digits are even less: f_up ~m a+(δm a+m δa) f_down~m a-(δm a+m δa) and so δf=δm a+m δa If you look at the relative error, you get: δf/f=δm/m+δa/a

For division it is δf/f=δm/m-δa/a

Hope that gets the gist across and hope I didn't make too many mistakes, it's late here :-)

tl,dr: Significant digits mean how many of the digits in the output actually come from the digits in your input (in the real world, not the distorted picture that floating point numbers have). If your measurements were 1 with "no" error and 3 with "no" error and the function is supposed to be 1/3, then yes, all infinite digits are actual significant digits. Otherwise, the inverse operation would not work, so obviously they have to be.

If significant digit rule means something completely different in another field, carry on :-)

Implied answered 26/4, 2011 at 22:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.