How to optimize this product of three matrices in C++ for x86?
Asked Answered
S

4

9

I have a key algorithm in which most of its runtime is spent on calculating a dense matrix product:

A*A'*Y, where: A is an m-by-n matrix, 
               A' is its conjugate transpose,
               Y is an m-by-k matrix

Typical characteristics:
    - k is much smaller than both m or n (k is typically < 10)
    - m in the range [500, 2000]
    - n in the range [100, 1000]

Based on these dimensions, according to the lessons of the matrix chain multiplication problem, it's clear that it's optimal in a number-of-operations sense to structure the computation as A*(A'*Y). My current implementation does that, and the performance boost from just forcing that associativity to the expression is noticeable.

My application is written in C++ for the x86_64 platform. I'm using the Eigen linear algebra library, with Intel's Math Kernel Library as a backend. Eigen is able to use IMKL's BLAS interface to perform the multiplication, and the boost from moving to Eigen's native SSE2 implementation to Intel's optimized, AVX-based implementation on my Sandy Bridge machine is also significant.

However, the expression A * (A.adjoint() * Y) (written in Eigen parlance) gets decomposed into two general matrix-matrix products (calls to the xGEMM BLAS routine), with a temporary matrix created in between. I'm wondering if, by going to a specialized implementation for evaluating the entire expression at once, I can arrive at an implementation that is faster than the generic one that I have now. A couple observations that lead me to believe this are:

  • Using the typical dimensions described above, the input matrix A usually won't fit in cache. Therefore, the specific memory access pattern used to calculate the three-matrix product would be key. Obviously, avoiding the creation of a temporary matrix for the partial product would also be advantageous.

  • A and its conjugate transpose obviously have a very related structure that could possibly be exploited to improve the memory access pattern for the overall expression.

Are there any standard techniques for implementing this sort of expression in a cache-friendly way? Most optimization techniques that I've found for matrix multiplication are for the standard A*B case, not larger expressions. I'm comfortable with the micro-optimization aspects of the problem, such as translating into the appropriate SIMD instruction sets, but I'm looking for any references out there for breaking this structure down in the most memory-friendly manner possible.

Edit: Based on the responses that have come in thus far, I think I was a bit unclear above. The fact that I'm using C++/Eigen is really just an implementation detail from my perspective on this problem. Eigen does a great job of implementing expression templates, but evaluating this type of problem as a simple expression just isn't supported (only products of 2 general dense matrices are).

At a higher level than how the expressions would be evaluated by a compiler, I'm looking for a more efficient mathematical breakdown of the composite multiplication operation, with a bent toward avoiding unneeded redundant memory accesses due to the common structure of A and its conjugate transpose. The result would likely be difficult to implement efficiently in pure Eigen, so I would likely just implement it in a specialized routine with SIMD intrinsics.

Strigose answered 27/2, 2014 at 15:38 Comment(15)
The temporary matrix you mention - could that have to do with this? eigen.tuxfamily.org/dox-devel/…. It's interesting that you have A A' Y by the way, instead of the more common A Y A'. In addition, I don't know how you construct A, but maybe you can construct A A' as you go along (not sure if there is an efficient way).Eolith
@CompuChip: No, I think it has more to do with the fact that when evaluating the matrix product, it only can calculate the product of two matrices at a time. Thus, you need a temporary in order to multiply by the third matrix. Note that due to the dimensions of A, constructing A*A' and then multiplying by it results in many more required operations (hence the link to the matrix chain multiplication article), so it's much better to insert the parentheses as shown.Strigose
I presume you've looked at the number of threads used by Eigen / disabling hyper-threading, as in the accepted answer here : #14783719 ?Irita
@GrahamGriffiths: Thanks for the idea. I'm currently running in single-threaded mode for reasons outside the scope of this particular question; I'm really looking just at algorithmic optimization.Strigose
The standard solution is expression templates, but Eigen apparently already implements them. Be sure you're doing whatever is required to take advantage of that.Schroth
so I think Y fits in the cache, and also the partial product (A.adjoint() * Y) fits in the cache. Is there a matrix mult optimization where just one of your matrices fits wholly in the cache?Irita
@GrahamGriffiths There are cache oblivious algorithms for matrix multiplication.Schroth
How often do matrices A and Y change? If A changes seldom, could you do the product A*A' outside the loop?Machree
@MikeDunlavey: A changes more often than Y does. However, see my comment below on ggael's answer as to why forming the product A*A' is a bad idea.Strigose
is your matrix A dense or sparse?Eth
@sebas: Both A and Y are dense. I updated the OP to reflect this.Strigose
How do you know it's not optimized? Have you calculated the FLOPS and compared to the peak FLOPS? The number of flops should be 2.0*m*n*m + 2.0*m*m*k I think. Note that I think it's going to be a huge investment in time to try and optimize this better than MKL so this is only worth doing for education purposes.Freeboot
Note also that for large dense matrix multiplication, if the algorithm is optimized (as it is in MKL), the cache does not matter: it's computation bound not memory bound.Freeboot
Ah I just read the matrix chain link. The flops for (AA')Y is 2.0*mnm + 2.0*mmk but for A*(A'Y) it's 2.0*(mxk) + 2.0*(nm*k). No wonder it's faster.Freeboot
Migrate to scientific computation. This Q involves intel mkl, matrix-matrix multiplication, the experts are not here (programmers will suggest writing your own thing, which is generally the wrong answer for num comp).Darien
F
4

This is not a full answer (yet - and I'm not sure it will become one).

Let's think of the math first a little. Since matrix multiplication is associative we can either do (A*A')Y or A(A'*Y).

Floating point operations for (A*A')*Y

2*m*n*m + 2*m*m*k //the twos come from addition and multiplication

Floating point operations for A*(A'*Y)

2*m*n*k + 2*m*n*k = 4*m*n*k

Since k is much smaller than m and n it's clear why the second case is much faster.

But by symmetry we could in principle reduce the number of calculations for A*A' by two (though this might not be easy to do with SIMD) so we could reduce the number of floating point operations of (A*A')*Y to

m*n*m + 2*m*m*k.

We know that both m and n are larger than k. Let's choose a new variable for m and n called z and find out where case one and two are equal:

z*z*z + 2*z*z*k = 4*z*z*k  //now simplify
z = 2*k.

So as long as m and n are both more than twice k the second case will have less floating point operations. In your case m and n are both more than 100 and k less than 10 so case two uses far fewer floating point operations.

In terms of efficient code. If the code is optimized for efficient use of the cache (as MKL and Eigen are) then large dense matrix multiplication is computation bound and not memory bound so you don't have to worry about the cache. MKL is faster than Eigen since MKL uses AVX (and maybe fma3 now?).

I don't think you will be able to do this more efficiently than you're already doing using the second case and MKL (through Eigen). Enable OpenMP to get maximum FLOPS.

You should calculate the efficiency by comparing FLOPS to the peak FLOPS of you processor. Assuming you have a Sandy Bridge/Ivy Bridge processor. The peak SP FLOPS is

frequency * number of physical cores * 8 (8-wide AVX SP) * 2 (addition + multiplication)

For double precession divide by two. If you have Haswell and MKL uses FMA then double the peak FLOPS. To get the frequency right you have to use the turbo boost values for all cores (it's lower than for a single core). You can look this up if you have not overclocked your system or use CPU-Z on Windows or Powertop on Linux if you have an overclocked system.

Freeboot answered 28/2, 2014 at 9:26 Comment(7)
As far as I can tell MKL supports FMA3 now for Haswell software.intel.com/en-us/articles/whats-new-in-intel-mkl. MKL GEMM is probably a lot faster on Haswell then.Freeboot
Thanks for the observations. Shame on me for prematurely optimizing without doing an isolated benchmark and comparison against the theoretical numbers. For m=1000, n=100, k=4, I'm seeing around 48-50 GLOPS/sec on my Sandy Bridge machine @ 3.4 GHz, so that's pretty close to the theoretical maximum. Using 4 threads, I can get about a 2.7x scaling as well. I think you're right, I don't know that there's anything to be gained here unless there was a way to significantly reduce the number of overall ops.Strigose
@JasonR, what formula did you use to calculate the flops? For case two FLOPS = 4*mnk/time. So for m=1000, n=100, k=4 that's 1.6*10^6/time -> time is 32µs.Freeboot
I used that formula. I had to tweak it a bit because I'm using complex arithmetic (1 complex multiplication = 4 real multiplications and 4 real additions and 1 complex addition = 2 real additions).Strigose
Technically it's not exactly a factor of two. The dot product is n multiplications and n-1 additions. For a square matrix it's then nn*(n + n-1) flops = 2*nnn - nn. Most people throw out the second term for large matrices.Freeboot
But for BLAS GEMM, I think you would consider each dot product to be n multiplications and n additions, since you're forming the result C + A*B. In my case C would be initialized with zeros, but those additions still would be executed.Strigose
Oh, good point! AB has 2*nnn - nn flops but AB + C is 2*nn*n flops. I just learned something. Thanks!Freeboot
D
0

Use a temporary matrix to compute A'*Y, but make sure you tell eigen that there's no aliasing going on: temp.noalias() = A.adjoint()*Y. Then compute your result, once again telling eigen that objects aren't aliased: result.noalias() = A*temp.

Dola answered 27/2, 2014 at 15:59 Comment(1)
Thanks for the suggestion. I don't believe the construction of the temporary itself is the critical path. Adjusting the expression to use explicit temporaries that are specified to be non-aliased yields a speedup of a couple percent or so. I'm instead after a better scheme of evaluating the overall expression that would have more favorable memory access patterns on the matrices involved.Strigose
C
0

There would be redundant computation only if you would perform (A*A')*Y since in this case (A*A') is symmetric and only half of the computation are required. However, as you noticed it is still much faster to perform A*(A'*Y) in which case there is no redundant computations. I confirm that the cost of the temporary creation is completely negligible here.

Criterion answered 27/2, 2014 at 16:57 Comment(2)
I guess I am not understanding this but how will (A'*Y) be faster?Aphanite
@JohnOdom: See this page for a description of the differences in total number of operations that can occur just by changing the associativity of a matrix multiplication expression. In this case, since k (the number of rows of Y) is much smaller than either dimension of A, it behooves you to do that multiplication first, as the intermediate result will also only have k columns. If you do 'A*A'` first, forming that product requires many more operations.Strigose
E
0

I guess that perform the following

result = A * (A.adjoint() * Y)

will be the same as do that

temp = A.adjoint() * Y
result = A * temp;

If your matrix Y fits in the cache, you can probably take advantage of doing it like that

result = A * (Y.adjoint() * A).adjoint()

or, if the previous notation is not allowed, like that

temp = Y.adjoint() * A
result = A * temp.adjoint();

Then you dont need to do the adjoint of matrix A, and store the temporary adjoint matrix for A, that will be much expensive that the one for Y.

If your matrix Y fits in the cache, it should be much faster doing a loop runing over the colums of A for the first multiplication, and then over the rows of A for the second multimplication (having Y.adjoint() in the cache for the first multiplication and temp.adjoint() for the second), but I guess that internally eigen is already taking care of that things.

Eth answered 27/2, 2014 at 18:51 Comment(1)
Thanks for the suggestion. I tried that variation (computing A*(Y'*A)' instead) and saw no difference in performance. I do know that Eigen is smart enough to not explicitly calculate the adjoint of A. The return value of A.adjoint() is an expression object that just wraps A with the storage order reversed (from column-major to row-major). So, it doesn't actually make a copy of A or anything.Strigose

© 2022 - 2024 — McMap. All rights reserved.