Is there any accuracy gain when casting to double and back when doing float division?

Asked 5/2, 2015 at 12:17 Answered 5/2, 2015 at 17:58

Solved c floating-point floating-accuracy ieee-754

What is the difference between two following?

float f1 = some_number;
float f2 = some_near_zero_number;
float result;

result = f1 / f2;

and:

float f1 = some_number;
float f2 = some_near_zero_number;
float result;

result = (double)f1 / (double)f2;

I am especially interested in very small f2 values which may produce +infinity when operating on floats. Is there any accuracy to be gained?

Some practical guidelines for using this kind of cast would be nice as well.

Upholster answered 5/2, 2015 at 12:17 Comment(3)

If you are worrying about rounding errors, why would you be using float in the first place? – Urine 5/2, 2015 at 20:27

because I keep huge structures in RAM (several GB's or more) and using double's is not an option for storage; casting back and forth is an option when doing calculations though; – Upholster 5/2, 2015 at 22:32

Noteworthy fact: x86 uses 80 bits for floating-point division, whether the types are 32-bit or 64-bit. – Rothstein 5/2, 2015 at 23:8

I am going to assume IEEE 754 binary floating point arithmetic, with float 32 bit and double 64 bit.

In general, there is no advantage to doing the calculation in double, and in some cases it may make things worse through doing two rounding steps.

Conversion from float to double is exact. For the infinite, NaN, or zero divisor inputs it makes no differences. Given a finite number result, the IEEE 754 standard requires the result to be the result of the real number division f1/f2, rounded to the type being using in the division.

If it is done as a float division that is the closest float to the exact result. If it is done as double division, it will be the closest double with an additional rounding step for the assignment to result.

For most inputs, the two will give the same answer. Any overflow or underflow that did not happen on the division because it was done in double will happen instead on the conversion.

For simple conversion, if the answer is very close to half way between two float values the two rounding steps may pick the wrong float. I had assumed this could also apply to division results. However, Pascal Cuoq, in a comment on this answer, has called attention to a very interesting paper, Innocuous Double Rounding of Basic Arithmetic Operations by Pierre Roux, claiming proof that double rounding is harmless for several operations, including division, under conditions that are implied by the assumptions I made at the start of this answer.

Adest answered 5/2, 2015 at 12:27 Comment(2)

Note that / is one of the operations that does not suffer from double-rounding when the significand of the intermediate format is at least twice as wide as the significand of the final format. This is the case when the intermediate format is binary64 and the final format binary32. Figueroa proved this for normal intermediate results and Pierre Roux appears to have decided to verify it formally and for all cases: hal.archives-ouvertes.fr/hal-01091186/document – Migratory 5/2, 2015 at 13:51

@PascalCuoq Thanks for the information, which I have folded into the answer. – Adest 5/2, 2015 at 14:49

If the result of an individual floating-point addition, subtraction, multiply, or divide, is immediately stored to a float, there will be no accuracy improvement using double for intermediate values. In cases where operations are chained together, however, accuracy will often be improved by using a higher-precision intermediate type, provided that one is consistent in using them. In Turbo Pascal circa 1986 code like:

Function TriangleArea(A: Single, B:Single, C:Single): Single
Begin
  Var S: Extended;  (* S stands for Semi-perimeter *)
  S := (A+B+C) * 0.5;
  TriangleArea := Sqrt((S-A)*(S-B)*(S-C)*S)
End;

would extend all operands of floating-point operations to type Extended (80-bit float), and then convert them back to single- or double-precision when storing to variables of those types. Very nice semantics for numerical processing. Turbo C of that area behaved similarly, but rather unhelpfully failed to provide any numeric type capable of holding intermediate results; the failure of languages to provide a variable type which could hold intermediate results led to people's unfairly criticizing the concept of a higher-precision intermediate result type, when the real problem was that languages failed to support it properly.

Anyway, if one were to write the above method into a modern language like C#:

    public static float triangleArea(float a, float b, float c)
    {
        double s = (a + b + c) * 0.5;
        return (double)(Math.Sqrt((s - a) * (s - b) * (s - c) * s));
    }

the code would work well if the compiler happens to promote the operands of the addition to double before performing the computation, but that's something it may or may not do. If the compiler performs the calculation as float, precision may be horrid. When using the above formula to compute the area of an isosceles triangle with long sides of 16777215 and a short side of 4, for example, eager promotion will yield a correct result of 3.355443E+7 while performing the math as float will, depending upon the order of the operands, yield 5.033165E+7 [more than 50% too big] or 16777214.0 [more than 50% too small].

Note that even though code like the above will work perfectly on some environments, but yield completely bogus results on others, compilers will generally not give any warning about the situation.

Although individual operations on float which are going to be immediately stored to float can be done just as accurately with type float as they could be with type double, eagerly promoting operands will often help considerably when operations are combined. In some cases, rearranging operations may avoid problems caused by loss of promotion (e.g. the above formula uses five additions, four multiplications, and a square root; rewriting the formula as:

Math.Sqrt((a+b+c)*(b-a+c)*(a-b+c)*(a-c+b))*0.25

increases the number of additions to eight, but will work correctly even if they are performed at single precision.

Duffey answered 5/2, 2015 at 17:58 Comment(0)

"Accuracy gain when casting to double and back when doing float division?"
The result depends on other factors aside from only the 2 posted methods.

C allows evaluation of float operations to happen at different levels depending on FLT_EVAL_METHOD. (See below table) If the current setting is 1 or 2, the two methods posted by OP will provide the same answer.

Depending on other code and compiler optimization levels, the quotient result may be used at wider precision in subsequent calculations in either of OP's cases.

Because of this, a float division that overflows or becomes to 0.0 (a result with total loss of precision) due to extreme float values, and if optimized for subsequent calculations may in fact not over/under flow as the quotient was carried forward as double.

To compel the quotient to become a float for future calculations in the midst of potential optimizations, code often uses volatile

volatile float result = f1 / f2;

C does not specify the precision of math operations, yet common application of standards like IEEE 754 provide the a single operation like binary32 divide will result in the closest answer representable. Should the divide occur at a wider format like double or long double, then the wider quotient conversion back to float experiences another rounding step that in rare occasions will result in a different answer than the direct float/float.

FLT_EVAL_METHOD
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the type;
1 evaluate operations and constants of type float and double to the range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;
2 evaluate all operations and constants to the range and precision of the long double type.

Practical guidelines:
Use float vs. double to conserve space when needed. (float is usually narrower, rarely the same, as double) If precision is important, use double (or long double).

Using float vs. double to improve speed may or may not work as a platform's native operations may all be double. It may be faster, same or slower - profile to find out. Much of C was originally designed with double as only level FP was carried out aside from double to/from float conversions. Later C has added functions like sinf() to facilitate faster, direct float operations. So the more modern the compiler/platform, more likely float will be faster. Again: profile to find out.

Antenna answered 5/2, 2015 at 13:23 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags