Precision of floating-point computations
C++11 incorporates the definition of FLT_EVAL_METHOD
from C99 in cfloat
.
FLT_EVAL_METHOD
Possible values:
-1 undetermined
0 evaluate just to the range and precision of the type
1 evaluate float and double as double, and long double as long double.
2 evaluate all as long double
If your compiler defines FLT_EVAL_METHOD
as 2, then the computations of r1
and r2
, and of s1
and s2
below are respectively equivalent:
double var3 = …;
double var4 = …;
double r1 = var3 * var4;
double r2 = (long double)var3 * (long double)var4;
long double s1 = var3 * var4;
long double s2 = (long double)var3 * (long double)var4;
If your compiler defines FLT_EVAL_METHOD as 2, then in all four computations above, the multiplication is done at the precision of the long double
type.
However, if the compiler defines FLT_EVAL_METHOD
as 0 or 1, r1
and r2
, and respectively s1
and s2
, aren't always the same. The multiplications when computing r1
and s1
are done at the precision of double
. The multiplications when computing r2
and s2
are done at the precision of long double
.
Getting wide results from narrow arguments
If you are computing results that are destined to be stored in a wider result type than the type of the operands, as are result1
and result2
in your question, you should always convert the arguments to a type at least as wide as the target, as you do here:
result2=(long double)var3*(long double)var4;
Without this conversion (if you write var3 * var4
), if the compiler's definition of FLT_EVAL_METHOD
is 0 or 1, the product will be computed in the precision of double
, which is a shame, since it is destined to be stored in a long double
.
If the compiler defines FLT_EVAL_METHOD
as 2, then the conversions in (long double)var3*(long double)var4
are not necessary, but they do not hurt either: the expression means exactly the same thing with and without them.
Digression: if the destination format is as narrow as the arguments, when is extended-precision for intermediate results better?
Paradoxically, for a single operation, rounding only once to the target precision is best. The only effect of computing a single multiplication in extended precision is that the result will be rounded to extended precision and then to double
precision. This makes it less accurate. In other words, with FLT_EVAL_METHOD
0 or 1, the result r2
above is sometimes less accurate than r1
because of double-rounding, and if the compiler uses IEEE 754 floating-point, never better.
The situation is different for larger expressions that contain several operations. For these, it is usually better to compute intermediate results in extended precision, either through explicit conversions or because the compiler uses FLT_EVAL_METHOD == 2
. This question and its accepted answer show that when computing with 80-bit extended precision intermediate computations for binary64 IEEE 754 arguments and results, the interpolation formula u2 * (1.0 - u1) + u1 * u3
always yields a result between u2
and u3
for u1
between 0 and 1. This property may not hold for binary64-precision intermediate computations because of the larger rounding errors then.