What precision are floating-point arithmetic operations done in?

Asked 14/8, 2014 at 7:35 Answered 14/8, 2014 at 13:34

c++floating-point rounding floating-point-precision arithmetic-expressions

Consider two very simple multiplications below:

double result1;
long double result2;
float var1=3.1;
float var2=6.789;
double var3=87.45;
double var4=234.987;

result1=var1*var2;
result2=var3*var4;

Are multiplications by default done in a higher precision than the operands? I mean in case of first multiplication is it done in double precision and in case of second one in x86 architecture is it done in 80-bit extended-precision or we should cast operands in expressions to the higher precision ourselves like below?

result1=(double)var1*(double)var2;
result2=(long double)var3*(long double)var4;

What about other operations(add, division and remainder)? For example when adding more than two positive single-precision values, using extra significant bits of double-precision can decrease round-off errors if used to hold intermediate results of expression.

Guthrun answered 14/8, 2014 at 7:35 Comment(3)

you should read floating-point-gui.de – Precritical 14/8, 2014 at 7:54

@BasileStarynkevitch: how does that address the question? – Veedis 14/8, 2014 at 7:55

This greatly depends on your compiler version. The current versions of the big three all use SSE2 so use 64-bit precision. Just look at the generated machine code to know. You'll need better test code, it is done at compile-time for the snippets you posted. – Hanafee 14/8, 2014 at 8:14

Precision of floating-point computations

C++11 incorporates the definition of FLT_EVAL_METHOD from C99 in cfloat.

FLT_EVAL_METHOD     

Possible values:
-1 undetermined
 0 evaluate just to the range and precision of the type
 1 evaluate float and double as double, and long double as long double.
 2 evaluate all as long double

If your compiler defines FLT_EVAL_METHOD as 2, then the computations of r1 and r2, and of s1 and s2 below are respectively equivalent:

double var3 = …;
double var4 = …;

double r1 = var3 * var4;
double r2 = (long double)var3 * (long double)var4;

long double s1 = var3 * var4;
long double s2 = (long double)var3 * (long double)var4;

If your compiler defines FLT_EVAL_METHOD as 2, then in all four computations above, the multiplication is done at the precision of the long double type.

However, if the compiler defines FLT_EVAL_METHOD as 0 or 1, r1 and r2, and respectively s1 and s2, aren't always the same. The multiplications when computing r1 and s1 are done at the precision of double. The multiplications when computing r2 and s2 are done at the precision of long double.

Getting wide results from narrow arguments

If you are computing results that are destined to be stored in a wider result type than the type of the operands, as are result1 and result2 in your question, you should always convert the arguments to a type at least as wide as the target, as you do here:

result2=(long double)var3*(long double)var4;

Without this conversion (if you write var3 * var4), if the compiler's definition of FLT_EVAL_METHOD is 0 or 1, the product will be computed in the precision of double, which is a shame, since it is destined to be stored in a long double.

If the compiler defines FLT_EVAL_METHOD as 2, then the conversions in (long double)var3*(long double)var4 are not necessary, but they do not hurt either: the expression means exactly the same thing with and without them.

Digression: if the destination format is as narrow as the arguments, when is extended-precision for intermediate results better?

Paradoxically, for a single operation, rounding only once to the target precision is best. The only effect of computing a single multiplication in extended precision is that the result will be rounded to extended precision and then to double precision. This makes it less accurate. In other words, with FLT_EVAL_METHOD 0 or 1, the result r2 above is sometimes less accurate than r1 because of double-rounding, and if the compiler uses IEEE 754 floating-point, never better.

The situation is different for larger expressions that contain several operations. For these, it is usually better to compute intermediate results in extended precision, either through explicit conversions or because the compiler uses FLT_EVAL_METHOD == 2. This question and its accepted answer show that when computing with 80-bit extended precision intermediate computations for binary64 IEEE 754 arguments and results, the interpolation formula u2 * (1.0 - u1) + u1 * u3 always yields a result between u2 and u3 for u1 between 0 and 1. This property may not hold for binary64-precision intermediate computations because of the larger rounding errors then.

Hargrave answered 14/8, 2014 at 8:33 Comment(20)

well I think you didn't get it correctly that double-rounding issue doesn't apply here at all! it's about intermediate/final results getting rounded more than once and here in your example no such thing happens just operands are converted to a higher precision(by the way this is not called rounding) and result gets rounded just one time to double(you might wanna know internally in FP unit, at least double/single multiplication is done in higher precision that's what Garp has mentioned in #1 in its post) – Guthrun 14/8, 2014 at 16:41

@Guthrun I am glad that you found some help in another answer, but in the computation of r2 (and in the computation of r1 when FLT_EVAL_METHOD is 2), the final result does “get rounded more than once”, and that is called double-rounding. The result is computed and rounded to long double precision (you could argue against this use of the word “rounded”, although it is common to explain IEEE 754 basic operations as computing the real result and then rounding to the precision of the operation), and then rounded to double in order to be stored in memory. – Hargrave 14/8, 2014 at 16:46

@Guthrun In the examples in your question, however, you are right that there is no double-rounding. I had not noticed this. I will expand my answer. – Hargrave 14/8, 2014 at 16:54

I don't seem to understand your point, what precision is result in that it needs to be rounded to long-double precision first and then double? – Guthrun 14/8, 2014 at 17:5

@Guthrun When you assign the result to a variable of type double, it has to be converted from the extended format in the FPU to the format of double. C99 is very clear about this (although C++11 is not quite as clear). – Hargrave 14/8, 2014 at 17:8

I know but where is second time of rounding ? – Guthrun 14/8, 2014 at 17:10

@Pooria: that is the second rounding. The first rounding is when the “infinitely-precise” result of the multiplication is rounded to working precision. – Kcal 14/8, 2014 at 17:12

@Guthrun Let us say you have an FPU for the 80-bit extended format. What I call the first rounding is because the FPU computes “as if” it calculated the exact result and then rounded to the 80-bit format. The second rounding happens when this result needs to be rounded from the 80-bit format to the 64-bit double format for storing in a variable. – Hargrave 14/8, 2014 at 17:15

so you mean in x86 a multiplication can be done in a more precision than 80-bit extended? doesn't it require registers with a number of significand bits near twice that of 80-bit extended? – Guthrun 14/8, 2014 at 17:22

@Guthrun The result of the multiplication is “as if” the exact result had been computed and rounded. Depending on the algorithms used internally, this may require representing the exact result with twice the number of significand bits, or tricks can be used to save space (the three “guard bits” that Garp refers to in his answer). My speciality is not the hardware implementation, so I cannot say much. But the result should be “as if” the exact value had been computed and rounded. – Hargrave 14/8, 2014 at 17:26

I don't think "as if exactly computed" applies in case of doing arithmetic operation on extended precision, as quoted here, "It was designed to support a 32-bit "single precision" format and a 64-bit "double precision" format for encoding and interchanging floating point numbers. The double extended format was designed not to store data at higher precision as such, but rather primarily to allow for the computation of double results more reliably and accurately by minimizing overflow and roundoff-errors in intermediate calculations". – Guthrun 14/8, 2014 at 17:47

@Guthrun The justification for the 80-bit extended format is what you quote above (which is also what much of my answer is about). And, separately, the definition of all IEEE 754 basic operations (+, -, *, / and sqrt) in all formats is that they must produce the same result as if the exact result had been computed and rounded to the destination format. This is why people sometimes use the word “rounding” about the result of a basic operation. – Hargrave 14/8, 2014 at 17:54

after reading here what Garp said( FP multipliers use internally double the width of the operands to generate an intermediate result) doesn't seem to be true it depends on FLT_EVAL_METHOD that you said – Guthrun 14/8, 2014 at 18:47

@Guthrun Garp's answer is about the internal implementation of one operation (either 64-bit wide or 80-bit wide). FLT_EVAL_METHOD is about what operation the compiler decides to use, so that a C++ program can be compiled for the 387 and also for a modern SSE2 processors, and that it is possible to predict what a given program will do. Garp's answer and my answer are talking about different aspects, they are not contradictory, only complementary. – Hargrave 14/8, 2014 at 19:8

well here it says "FLT_EVAL_METHOD == 2 indicates that all internal intermediate computations are performed by default at high precision (long double) where available" and " FLT_EVAL_METHOD == 1 performs all internal intermediate expressions in double precision (unless an operand is long double)" this means to me the operation itself is done at a specific precision not just that result is rounded to that precision, doesn't "computing at a precision" also mean "computing internally in FP at a precision"? – Guthrun 14/8, 2014 at 19:27

@Guthrun What happens inside the FPU is only the FPU designer's problem. On a webpage that discusses C99, “computing at a precision P” means “using the instruction that takes operands of width P and produces a correctly rounded result of width P”, regardless of how this instruction is implemented (if the operation is a multiplication, it is likely implemented with a wider internal result in the processor, as Garp said, but that wider result that temporarily exists in the FPU is not stored). – Hargrave 14/8, 2014 at 19:36

as I said unfortunately "performing internal intermediate computations/expressions at a specific precision" has a very distinct meaning preventing the idea of for example using long-double precision registers for computing at double precision, text would have easily meant something other than it if it was meant to – Guthrun 14/8, 2014 at 19:57

@PascalCuoq Ok looks like you are right by "computing at a precision P means using the instruction that takes operands of width P and produces a correctly rounded result of width P regardless of how this instruction is implemented", according to this looks like in FPU internally double the width of significand of operands(+2 in case there were hidden bits) is used to produce product of significand bits, I had a wrong idea of internal FPU workings :) – Guthrun 14/8, 2014 at 22:2

I wonder why my 1980s compilers for the 8087 used behavior equivalent to FLT_EVAL_METHOD==2 cleanly and consistently, but over the years that gave way to loosy-goosy semantics? Computing intermediate results as long double is easier for a compiler than trying to have the intermediate result type vary with the operand type, and in most non-contrived situations gives more useful results. While 80-bit intermediate type won't guarantee that d4=d1+d2+d3; will be perfectly evaluated for all values of d1, d2, and d3, it will yield good results easily when the values are of comparable magnitudes. – Spadefish 14/10, 2014 at 3:22

By comparison, the amount of code required to compute a simple sum like d4=d1+d2+d3; for values of similar magnitudes ends up being much more cumbersome if intermediate result can't be computed using extended precision. – Spadefish 14/10, 2014 at 3:24

For floating point multiplication: FP multipliers use internally double the width of the operands to generate an intermediate result, which equals the real result within an infinite precision, and then round it to the target precision. Thus you should not worry about multiplication. The result is correctly rounded.
For floating point addition, the result is also correctly rounded as standard FP adders use extra sufficient 3 guard bits to compute a correctly rounded result.
For division, remainder and other complicated functions, like transcendentals such as sin, log, exp, etc... it depends mainly on the architecture and the used libraries. I recommend you to use the MPFR library if you seek correctly rounded results for division or any other complicated function.

Chromaticness answered 14/8, 2014 at 8:18 Comment(2)

interestingly you addressed my main concern in #1, but in case of x86 and long double(80-bit extended precision) there are no registers to hold double the width am I right? I mean there are quadruples but not in x86 :) – Guthrun 14/8, 2014 at 16:28

Thank you, but what I meant by that "multipliers use internally double the width of the operands" that this totally executed within the multiplier itself without your intervention. You will get your correctly rounded result no matter what the precision.For example if the operands are of 80 bits, i.e. 64 bits for mantissa, the multiplier compute a 124 bit long intermediate result then round it again to a 64 bit long result,then it saves it to your destination register along with the exponent and sign,constituting a 80 bit long result.TLDR you should not worry for FP addition and multiplication. – Chromaticness 14/8, 2014 at 18:30

The usual arthimetic conversions for floating point types are applied before multiplication, division, and modulus:

The usual arithmetic conversions are performed on the operands and determine the type of the result.

§5.6 [expr.mul]

Similarly for addition and subtraction:

The usual arithmetic conversions are performed for operands of arithmetic or enumeration type.

§5.7 [expr.add]

The usual arithmetic conversions for floating point types are laid out in the standard as follows:

Many binary operators that expect operands of arithmetic or enumeration type cause conversions and yield result types in a similar way. The purpose is to yield a common type, which is also the type of the result. This pattern is called the usual arithmetic conversions, which are defined as follows:

[...]

— If either operand is of type long double, the other shall be converted to long double.

— Otherwise, if either operand is double, the other shall be converted to double.

— Otherwise, if either operand is float, the other shall be converted to float.

§5 [expr]

The actual form/precision of these floating point types is implementation-defined:

The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.

§3.9.1 [basic.fundamental]

Read answered 14/8, 2014 at 7:46 Comment(4)

This answer misses the crux of the question; what precision are these calculations performed in behind the scenes? – Veedis 14/8, 2014 at 7:53

That is implementation defined. See §3.9.1 [basic.fundamental]. – Read 14/8, 2014 at 7:55

I could only reference std::limits<double> and std::limits<long double> classes – Assizes 14/8, 2014 at 7:58

You mean std::numeric_limits? – Read 14/8, 2014 at 7:59

Not a direct answer to your question, but for constant floating-point values (such as the ones specified in your question), the method that yields the least amount of precision-loss would be using the rational representation of each value as an integer numerator divided by an integer denominator, and perform as many integer-multiplications as possible before the actual floating-point-division.

For the floating-point values specified in your question:

int var1_num = 31;
int var1_den = 10;
int var2_num = 6789;
int var2_den = 1000;
int var3_num = 8745;
int var3_den = 100;
int var4_num = 234987;
int var4_den = 1000;
double result1 = (double)(var1_num*var2_num)/(var1_den*var2_den);
long double result2 = (long double)(var3_num*var4_num)/(var3_den*var4_den);

If any of the integer-products are too large to fit in an int, then you can use larger integer types:

unsigned int
signed   long
unsigned long
signed   long long
unsigned long long

Spermatic answered 14/8, 2014 at 13:34 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Precision of floating-point computations

Getting wide results from narrow arguments

Digression: if the destination format is as narrow as the arguments, when is extended-precision for intermediate results better?

Recommended topics

Hot tags