I have a key algorithm in which most of its runtime is spent on calculating a dense matrix product:
A*A'*Y, where: A is an m-by-n matrix,
A' is its conjugate transpose,
Y is an m-by-k matrix
Typical characteristics:
- k is much smaller than both m or n (k is typically < 10)
- m in the range [500, 2000]
- n in the range [100, 1000]
Based on these dimensions, according to the lessons of the matrix chain multiplication problem, it's clear that it's optimal in a number-of-operations sense to structure the computation as A*(A'*Y)
. My current implementation does that, and the performance boost from just forcing that associativity to the expression is noticeable.
My application is written in C++ for the x86_64 platform. I'm using the Eigen linear algebra library, with Intel's Math Kernel Library as a backend. Eigen is able to use IMKL's BLAS interface to perform the multiplication, and the boost from moving to Eigen's native SSE2 implementation to Intel's optimized, AVX-based implementation on my Sandy Bridge machine is also significant.
However, the expression A * (A.adjoint() * Y)
(written in Eigen parlance) gets decomposed into two general matrix-matrix products (calls to the xGEMM
BLAS routine), with a temporary matrix created in between. I'm wondering if, by going to a specialized implementation for evaluating the entire expression at once, I can arrive at an implementation that is faster than the generic one that I have now. A couple observations that lead me to believe this are:
Using the typical dimensions described above, the input matrix
A
usually won't fit in cache. Therefore, the specific memory access pattern used to calculate the three-matrix product would be key. Obviously, avoiding the creation of a temporary matrix for the partial product would also be advantageous.A
and its conjugate transpose obviously have a very related structure that could possibly be exploited to improve the memory access pattern for the overall expression.
Are there any standard techniques for implementing this sort of expression in a cache-friendly way? Most optimization techniques that I've found for matrix multiplication are for the standard A*B
case, not larger expressions. I'm comfortable with the micro-optimization aspects of the problem, such as translating into the appropriate SIMD instruction sets, but I'm looking for any references out there for breaking this structure down in the most memory-friendly manner possible.
Edit: Based on the responses that have come in thus far, I think I was a bit unclear above. The fact that I'm using C++/Eigen is really just an implementation detail from my perspective on this problem. Eigen does a great job of implementing expression templates, but evaluating this type of problem as a simple expression just isn't supported (only products of 2 general dense matrices are).
At a higher level than how the expressions would be evaluated by a compiler, I'm looking for a more efficient mathematical breakdown of the composite multiplication operation, with a bent toward avoiding unneeded redundant memory accesses due to the common structure of A
and its conjugate transpose. The result would likely be difficult to implement efficiently in pure Eigen, so I would likely just implement it in a specialized routine with SIMD intrinsics.
A
, constructingA*A'
and then multiplying by it results in many more required operations (hence the link to the matrix chain multiplication article), so it's much better to insert the parentheses as shown. – StrigoseA
changes more often thanY
does. However, see my comment below on ggael's answer as to why forming the productA*A'
is a bad idea. – StrigoseA
andY
are dense. I updated the OP to reflect this. – Strigose2.0*m*n*m + 2.0*m*m*k
I think. Note that I think it's going to be a huge investment in time to try and optimize this better than MKL so this is only worth doing for education purposes. – Freeboot