First try Eigen. Depending on compiler, you might need manually define macros for proper SIMD, for Zen 2-3 you’d want EIGEN_VECTORIZE_AVX
, EIGEN_VECTORIZE_FMA
and EIGEN_VECTORIZE_AVX2
, for Zen 4 also EIGEN_VECTORIZE_AVX512
.
Also, be sure to enable OpenMP in project settings.
If you wanna try improving performance further, your #1 goal is saving memory bandwidth. Multiplying matrix by vector is practically guaranteed to bottleneck on memory, not compute.
Reshape the matrix into panels, like that.
The numbers in the table are 0-based indices of the elements in memory.
Only instead of 4, use panel height = 32 for AVX, or 64 for AVX512.
Also, don’t forget to align data by at least vector size, ideally by 64 bytes (cache line)
Note the last panel of the matrix probably needs zero-padding of these columns. And ideally, the output vectors also need a few extra elements to make their length a multiple of the panel height, otherwise you need special code to handle the last panel of the matrix.
In the inner loop, do something like that, untested.
// Compute product of width*32 matrix by vector of length `width`,
// the result is vector of length 32
void multiplyInner_avx( const float* mat, const float* vec, size_t width, float* rdi )
{
// Initialize the accumulators
__m256 acc0 = _mm256_setzero_ps();
__m256 acc1 = _mm256_setzero_ps();
__m256 acc2 = _mm256_setzero_ps();
__m256 acc3 = _mm256_setzero_ps();
// Compute these products
const float* const vecEnd = vec + width;
while( vec < vecEnd )
{
const __m256 v = _mm256_broadcast_ss( vec );
vec++;
acc0 = _mm256_fmadd_ps( v, _mm256_load_ps( mat ), acc0 );
acc1 = _mm256_fmadd_ps( v, _mm256_load_ps( mat + 8 ), acc1 );
acc2 = _mm256_fmadd_ps( v, _mm256_load_ps( mat + 16 ), acc2 );
acc3 = _mm256_fmadd_ps( v, _mm256_load_ps( mat + 24 ), acc3 );
mat += 32;
}
// Store the products
_mm256_store_ps( rdi, acc0 );
_mm256_store_ps( rdi + 8, acc1 );
_mm256_store_ps( rdi + 16, acc2 );
_mm256_store_ps( rdi + 24, acc3 );
}
For Zen 4 you gonna need another version of the above, to leverage AVX512 vectors.
In the outer loop, divide the matrix into batches of approximately equal size, so that count of batches equals to the count of hardware threads in your CPU. Dispatch each batch into different CPU threads, an easy way to do that is OpenMP.
Ideally, make sure the process is stable, i.e. that when you call your multiplication function for different vectors, same batches of the input matrix are dispatched into the same CPU cores.
Note there’s no reduction anywhere, and no need for that. Different CPU cores simply compute different slices of the output vector, by multiplying different slices of the input matrix by the complete input vector.
The magic numbers are more or less arbitrary. Experiment to find the best ones. For instance, it’s probably a good idea to increase panel height by a factor of 2, making multiplyInner_avx
function to produce 64 elements of the output, at the cost of 4 more vector load + FMA instructions inside the loop.
Once you get a first working version, one possible micro-optimization — manually unroll the inner loop by 4, use _mm256_broadcast_ps
to load 4 vector elements with a single instruction, then _mm256_permute_ps
with constants _MM_SHUFFLE(i,i,i,i)
where i
in [0..3]
interval to broadcast each of the 4 loaded vector elements into another vector.
After each permute, multiply/accumulate the broadcasted vector by the current column of the matrix, and advance the matrix pointer to the next column.
For the last 1-3 elements of the vector, use the current version with _mm256_broadcast_ss
The _mm256_permute_ps
instruction is cheap, and most importantly on AMD it can run on vector port #2 (on Zen 4 also on port #3), these ports aren’t used by anything else in that loop.
I’m not 100% sure this gonna help because more instructions to decode and dispatch. But it might, because saves some memory loads.