Is there a numerically optimal order of matrix multiplication?
Asked Answered
H

4

9

TL;DR: The question is about multiplication ACCURACY

I have to multiply matrices A (100x8000), B (8000x27) and C (27x1).

Since matrices B and C are constant and A is variable, I prefer to calculate it as: ABC = np.dot(A, np.dot(B, C)). However I wonder, that it may be numerically worse (in terms of accuracy) than np.dot(np.dot(a, B), C).

What may be important: matrices A and B contain 8000 samples of (respectively) 100 and 27 correlated features.

Is there a numerically optimal (in terms of accuracy) order of the multiplication? If yes - how may I determine it?

Special Case

It may be assumed that both A and B matrices are nonnegative. Moreover:

C = np.linalg.solve(cov(B, k), X)

where X is a 27x1 matrix of 27 (possibly correlated) random variables of unknown distribution, cov = lambda X, k: np.dot(X.T, X) + k * np.eye(X.shape[1]), and k is a nonnegative constant minimizing the expression:

sum((X[i, 0] - np.dot(np.dot(B[:, [i]].T, drop(B, i)),
                      np.linalg.solve(cov(drop(B, i), k),
                                      np.delete(X, i, axis=0))) **2
    for i in range(27))

The drop() function is defined as lambda X, i: np.delete(X, i, axis=1).

Even More Special Case

It may be assumed that np.cov(B.T, B) is a covariance matrix of X, which follows multivariate Gaussian distribution.

Hardening answered 8/7, 2019 at 11:33 Comment(11)
when B and C are constant just save the multiplied result for all following calculations I guess that will forever be the fastest way and for evaluating the equation for n different A matrices as it will only result in n+1 matrix multiplications (otherwise you need like 2n multiplications).Pea
It might help to specify why you are worried about accuracy. The answer might be different depending on whether the matrices contain floats or large integers, for instance.Blackpoll
@Blackpoll they are all floats. I am worried about the accuracy, as I am going to change the current implementation (AB)C (which is considered accurate) for performance purposes. Unfortunately, I do not remember any recommendations on matrix multiplication order from my numerical analysis class. For sure I am going to run tests with different precision, but I am looking for more solid, theoretical background here. So far my only idea is to write naive expressions for elements of ABC and estimate the error.Hardening
This question (I think) cannot be answered in general without specifying the matrices. The most "accurate" result will depend on numerical roundoff which depends on the relative sizes of the values in the matrices.Isoline
Are the features also correlated between A and B? Because, if all the 100x27 pairwise correlations are significantly different from zero, I think you can't go too wrong multiplying B and C first.Appearance
@PaulPanzer Yes, they are supposed to be correlated, it is not warranted every pair of features is corelated though.Hardening
It would be enough if each feature in A were correlated with the linear combination of features given by BC, in other words ABC should have no entries that are very small as a result of cancellation. As far as I can tell cancellation is the only potential source of substantial loss of accuracy here, and if it's unavoidable the rule of thumb is to get it over with early on because any roundoff error that happens before will be greatly amplified. So in this scenario order of multiplication may be important.Appearance
@PaulPanzer I have provided more information on the matrices.Hardening
@Isoline dittoHardening
k doesn't seem to occur in the definition of covAppearance
@PaulPanzer fixed, thanks!Hardening
H
4

At the moment the best idea I have (for a particular set of matrices) is to perform the following numerical experiment:

  1. Calculate a reference matrix as an average of products calculated with high precision (e.g. `np.float128).
  2. Calculate test products with lower precision (np.float64, np.float32, even np.float16),
  3. Analyse errors calculated as a difference between test products and the reference matrix. The errors are expected to decline as the precision is higher.
Hardening answered 11/7, 2019 at 9:10 Comment(3)
I do not see that this error analysis holds. In sum(AB[i,k] * C[k,0]), there are both multiplications and additions. Both operations are subject to rounding errors, unless np.dot guarantees it will use fused multiply-add. In any case, the rounding errors will generally vary in each operation; they will not all be a single value E. Additionally, the result of adding N terms with an error of E in each addition does not have an error (1+E)**(N-1).Aikens
For example, with three terms A, B, and C, and assuming each addition does introduce a rounding error of E, the result would be ((A+B)*(1+E)+C)*(1+E), which differs from (A+B+C)*(1+E)**2.Aikens
@EricPostpischil You are absolutely right, thanks. Numeric experiment then. ;)Hardening
S
2

Suggest using the formula that results in the least floating point accumulation error.

  • (AB)C : each element of (AB) is a dot-product of each row in A with each column in B. For each column in B find the ratio of the max element to the min element, then find the maximum of those per-column ratios. Call this B_ratio.
  • A(BC) : let T = (BC), resulting in an 8000x1 matrix. Each element of A(BC) is a dot-product of each row in A with each [one] column in T. For each [one] column in T, find the ratio of the max element to the min element. Call this T_ratio.

Assuming a uniform distribution of values in A, whichever solution above has the smaller absolute ratio is the more numerically stable. i.e. if fabs(B_ratio) < fabs(T_ratio) then expect (AB)C to be better.

Rationale: Error accumulates when adding large and small numbers -- the smaller numbers' lower bits get "lost in the noise". By keeping the factors within a smaller absolute spread, the chance of losing individual small terms' contributions decreases.

Floating Point Details

When you add IEEE 754 floating point numbers z = x + y, one of these cases can occur:

  1. z.exponent == max(x.exponent, y.exponent) && z.mantissa != max(x,y).mantissa which means the mantissas summed with no carry-out. No error introduced.
  2. z.exponent == max(x.exponent, y.exponent)+1 which means the mantissas summed with carry-out. Lowest bit of precision is lost, resulting in error of 2^-(z.exponent - MANTISSA_BITS) introduced.
  3. z == max(x, y) meaning either x or y was larger by a factor of 2^MANTISSA_BITS, resulting in the total loss of the other input (it is even below the noise threshold).

For numerical stability, you can minimize the accumulated error by sorting the numbers first, then accumulate from smallest to largest. This avoids case #3 above from occurring as often, although it can still happen.

Additional Reading

What Every Computer Scientist Should Know About Floating-Point Arithmetic

Suspensive answered 16/7, 2019 at 1:48 Comment(1)
Sorry, but I'm totally not buying your reasoning. The loss of digits that cannot even be represented in fp is in itself not a serious concern. It can only become problematic if the number of terms is insanely large---not the case here---or in the event of subsequent cancellation---because then and only then will the lost digits not be tiny relative to the final result.Appearance
B
-1

Won't multiplying three matrices always be slower than multiplying only two?

You really have only two options: (AB)C and A(BC). Since BC = const, you can have a constant T = BC of shape 8000x1 and then multiply AT without having to recalculate T.

Bussell answered 8/7, 2019 at 11:39 Comment(1)
That is exactly my idea. ;) However I wonder, whether AT is as accurate as (AB)C.Hardening
P
-1

Hm, I don't think that there's a difference in speed for different multiplication orders. The calculation count should be the same. Also I don't see a possible improvement for caching like in other matrix calculations (like for the iteration order through an array).

The only point would be to safe the np.dot(B,C) in an extra matrix if they really won't change and if you need the result in multible calculations.

Pilsen answered 8/7, 2019 at 11:51 Comment(1)
Actually, there is a difference in speed - but it does not matter here. What matters is the accuracy.Hardening

© 2022 - 2024 — McMap. All rights reserved.