Floating point accuracy and order of operations

Asked 28/8, 2020 at 10:4 Answered 29/8, 2020 at 0:33

Solved floating-point language-agnostic precision linear-algebra floating-accuracy

I'm writing a unit test for a class for 3D vector objects and its algebra (dot product, cross product etc.) and just observed a behavior I can somehow understand, but not to its fully extent.

What I do is actually to generate 2 pseudorandom vectors, b and c, and a pseudorandom scalar, s, and subsequently check the results of different operations on those vectors.

b's components are generated in the range [-1, 1], while c's component in the range [-1e6, 1e6], since in my use case I'll encounter similar situations, which could cause a significant lost of information in the mantissa. s is generated in the range [-1, 1] as well.

I created a MWE in python (using numpy) just to expose better my question (but I'm actually coding in C++ and the question in itself is language agnostic):

b = np.array([0.4383006177615909, -0.017762134447941058, 0.56005552104818945])
c = np.array([-178151.26386435505, 159388.59511391702, -720098.47337336652])
s = -0.19796489160874975

I then define

d = s*np.cross(b,c)
e = np.cross(b,c)

And finally calculate

In [7]: np.dot(d,c)
Out[7]: -1.9073486328125e-06

In [8]: np.dot(e,c)
Out[8]: 0.0

In [9]: s*np.dot(e,c)
Out[9]: -0.0

Since d and e are both perpendicular to b and c, the scalar products computed above should all give 0 (algebraically).

Now, it's clear to me that in a real computer this can only achieved in the limits of floating point arithmetic. I however would like to better understand how this error generates.

What actually surprised me a bit is the poor accuracy of the first of the three results.

I'll try to expose my thoughts in the following:

np.cross(b, c) is basically [b[1]*c[2]-b[2]*c[1], b[2]*c[0]-b[0]*c[2], ...] which involves multiplication of a large and a small number and subsequent subtraction. e (the cross product b x c) itself keeps relative large components, i.e. array([-76475.97678585, 215845.00681978, 66695.77300175])
So, to get d you still multiply once pretty large components times a number <1. This of course will lead to some truncation error.
When taking the dot product e . c the result is correct, while in d . c the result is off of almost 2e-6. Can this last multiplication by s lead to such a big difference? A naïve thought would be to say that, given my machine epsilon of 2.22045e-16 and the magnitude of the components of d, the error should be around 4e-11.
Is the information of the mantissa lost in the subtraction taken in the cross product?

To check that last thought, I did:

In [10]: d = np.cross(s*b,c)                                                    

In [11]: np.dot(d,c)                                                            
Out[11]: 0.0

In [12]: d = np.cross(b,s*c)                                                    

In [13]: np.dot(d,c)                                                            
Out[13]: 0.0

And it indeed appears that in the subtraction I loose much more information. Is that correct? How can that be explained in terms of floating point approximation?

Also, does that mean that, regardless of the input (i.e., no matter if the two vectors are of similar magnitude or completely different), it is better to always perform first all the operations which involve multiplication (and division?), then those involving addition/subtraction?

Emotionalism answered 28/8, 2020 at 10:4 Comment(0)

The big loss of information most likely happens in the dot product and not in the cross product. In the cross product, the results that you get are still close to the order of magnitude of the entries in c. That means that you may have lost around one digit in precision, but the relative error should still be around 10^-15. (the relative error in the subtraction a-b is approximately equal to 2*(|a|+|b|) / (a-b))

The dot product it is the only operation involving a subtraction of two numbers that are very close to each other. This leads to an enormous increase in the relative error because we divide the previous relative error by ~0.

Now on to your example, the error that you get (~10^-6) is actually what you would expect considering the quantities that you have: c, e and d have a magnitude of ~10^5, which means that the absolute error is around 10^-11 at best. I don't care about s because it is basically equal to 1.

The absolute error when you multiply a*b is approximately |a|*|err_b| + |b|*|err_a| (worst case scenario where the errors don't cancel out). now in the dot product, you multiply 2 quantities with magnitude ~10^5, so the error should be in the range of 10^5*10^-11 + 10^5*10^-11 = 2*10^-6 (and multiply by 3 because you do this 3 times, for each component).

Then if 10^-6 is the expected error, how can I explain your results? Well, you were lucky: using these values (I changed b[0] and c[0])

b = np.array([0.4231830061776159, -0.017762134447941058, 0.56005552104818945])
c = np.array([-178151.28386435505, 159388.59511391702, -720098.47337336652])
s = -0.19796489160874975

I got (in order)

-1.9073486328125e-06
7.62939453125e-06
-1.5103522614192943e-06

-1.9073486328125e-06
-1.9073486328125e-06

Also, when you look at the relative error, it's doing a pretty good job:

In [10]: np.dot(d,c)
Out[11]: -1.9073486328125e-06

In [11]: np.dot(d,c) / (np.linalg.norm(e)*np.linalg.norm(c))
Out[11]: -1.1025045691772927e-17

Regarding the order of operations, I don't think it matters that much, as long as you are not subtracting 2 very close numbers. If you still need to subtract 2 very close numbers, I guess it would be better to do it in the end (not screwing everything up) but don't quote me on that.

Gertrudgertruda answered 28/8, 2020 at 18:27 Comment(2)

Thanks for your answer, which I accepted. For the interested reader, I also strongly suggest to read @njuffa's answer, which brings some insight in possible alternatives in C++. – Emotionalism 31/8, 2020 at 7:11

Also, one could be interested in reading that discussion, too: https://mcmap.net/q/456553/-what-is-the-worst-case-accuracy-for-dot-product/5380294 – Emotionalism 31/8, 2020 at 7:23

Miguel's answer is spot on. Just as an addendum, and since OP works with C++, I coded up the computation in the most accurate way I am aware of, taking advantage of fused multiply-add operations as much as possible. In addition, I tried a compensated dot product. One could think of this as the idea of the Kahan sum extended to the accumulation of a dot product. It makes no significant difference here.

The output of my code below, when compiled with the strictest IEEE-754 compliance compilers make available (for my Intel compiler, that is /fp:strict), should look similar to this:

Using FMA-based dot product:
dot(d,c)   = -1.0326118360251935e-006
dot(e,c)   =  4.3370577648224470e-006
s*dot(e,c) = -8.5858517031396220e-007
Using FMA-based compensated dot product:
dot(d,c)   = -1.1393800219802703e-006
dot(e,c)   =  3.0970281801622503e-006
s*dot(e,c) = -6.1310284799506335e-007

#include <cstdio>
#include <cstdlib>
#include <cmath>

typedef struct {
    double x;
    double y;
} double2;

typedef struct {
    double x;
    double y;
    double z;
} double3;

/*
  diff_of_prod() computes a*b-c*d with a maximum error < 1.5 ulp

  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller, 
  "Further Analysis of Kahan's Algorithm for the Accurate Computation 
  of 2x2 Determinants". Mathematics of Computation, Vol. 82, No. 284, 
  Oct. 2013, pp. 2245-2264
*/
double diff_of_prod (double a, double b, double c, double d)
{
    double w = d * c;
    double e = fma (-d, c, w);
    double f = fma (a, b, -w);
    return f + e;
}

double3 scale (double3 a, double s)
{
    double3 r;
    r.x = s * a.x;
    r.y = s * a.y;
    r.z = s * a.z;
    return r;
} 

double dot (double3 a, double3 b)
{
    return fma (a.x, b.x, fma (a.y, b.y, a.z * b.z));
}

double3 cross (double3 a, double3 b)
{
    double3 r;
    r.x = diff_of_prod (a.y, b.z, a.z, b.y);
    r.y = diff_of_prod (a.z, b.x, a.x, b.z);
    r.z = diff_of_prod (a.x, b.y, a.y, b.x);
    return r;
}

/* returns the sum of a and b as a double-double */
double2 TwoProdFMA (double a, double b)
{
    double2 r;
    r.x = a * b;
    r.y = fma (a, b, -r.x);
    return r;
}

/* returns the product of a and b as a double-double. Knuth TAOCP */
double2 TwoSum (double a, double b)
{
    double2 res;
    double s, r, t;
    s = a + b;
    t = s - a;
    r = (a - (s - t)) + (b - t);
    res.x = s;
    res.y = r;
    return res;
}

/*
  S. Graillat, Ph. Langlois and N. Louvet, "Accurate dot products with FMA",
  In: RNC-7, Real Numbers and Computer Conference, Nancy, France, July 2006,
  pp. 141-142
*/
double compensated_dot (double3 x, double3 y)
{
    double2 t1, t2, t3;
    double sb, cb, pb, pi, sg;

    t1 = TwoProdFMA (x.x, y.x);
    sb = t1.x;
    cb = t1.y;

    t2 = TwoProdFMA (x.y, y.y);
    pb = t2.x;
    pi = t2.y;
    t3 = TwoSum (sb, pb);
    sb = t3.x;
    sg = t3.y;
    cb = (pi + sg) + cb;

    t2 = TwoProdFMA (x.z, y.z);
    pb = t2.x;
    pi = t2.y;
    t3 = TwoSum (sb, pb);
    sb = t3.x;
    sg = t3.y;
    cb = (pi + sg) + cb;

    return sb + cb;
}

int main (void)
{
    double3 b = {0.4383006177615909, -0.017762134447941058, 0.56005552104818945};
    double3 c = {-178151.26386435505, 159388.59511391702, -720098.47337336652};
    double s = -0.19796489160874975;
    double3 d = scale (cross (b, c), s);
    double3 e = cross (b, c);

    printf ("Using FMA-based dot product:\n");
    printf ("dot(d,c)   = % 23.16e\n", dot (d, c));
    printf ("dot(e,c)   = % 23.16e\n", dot (e, c));
    printf ("s*dot(e,c) = % 23.16e\n", s * dot (e, c));

    printf ("Using FMA-based compensated dot product:\n");
    printf ("dot(d,c)   = % 23.16e\n", compensated_dot (d, c));
    printf ("dot(e,c)   = % 23.16e\n", compensated_dot (e, c));
    printf ("s*dot(e,c) = % 23.16e\n", s * compensated_dot (e, c));

    return EXIT_SUCCESS;
}

Mweru answered 29/8, 2020 at 0:33 Comment(3)

Thanks for your very interesting answer! I added a comment on @Miguel's answer but did not get a reply. Could you maybe comment on why there is such a difference in the error when scaling the cross product vs. scaling one of the vectors before taking the cross product, even though the scaling factor is almost 1? That was indeed quite a pattern when sampling 1e4 random vectors and repeating the same tests. – Emotionalism 31/8, 2020 at 6:25

@David: I am not sure I understand your question. The fact that you happened to get a result of 0 in one instance seems purely coincidental to me. In general, you would want to be always looking at relative, not absolute error, when dealing with floating point computation. – Mweru 31/8, 2020 at 6:53

I did some testing again and it was indeed coincidental. Sorry about that. – Emotionalism 31/8, 2020 at 7:8

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags