I was tasked with implementing an optimised matrix multiplication micro-kernel that computes C = A*B
in C++ starting from the following snippet of code. I am getting some counter intuitive behaviour and I would like some help to better understand what is going on.
void mat_mul(double* A, double* B, double* C) {
for (int n = 0; n < N; ++n) {
for (int k = 0; k < K; ++k) {
for (int m = 0; m < M; ++m) {
C[m + n*M] += A[m + k*M] * B[k + n*K];
}
}
}
}
The conditions for the challenge are as follows:
- M, N, K are defined at compile time
- K = 128
- M, N can be chosen at compile time but have to be < 32
- The matrices are in column-major order
- The data alignment can be decided at compile time (default=32 byte alignment)
- The matrix A has dimension MxK
- The matrix B has dimension KxN
- The matrix C has dimension MxN
- The code must run on a single core of an AMD EPYC 7742 CPU (AVX2 + FMA support)
- Blocking is not allowed
By looking at the order of the loop, it seems like the memory access order already minimises cache misses as we are iterating over the buffers linearly.
My first thought was to try and vectorise the code. This is what I came up with:
void mat_mul(double* A, double* B, double* C) {
for (int n = 0; n < N; ++n) {
for (int k = 0; k < K; ++k) {
__m256d B_v = _mm256_broadcast_sd(B + k + n*K);
for (int m = 0; m < M; m+=4) {
__m256d A_v = _mm256_load_pd(A + m + k*M);
__m256d C_v = _mm256_load_pd(C + m + n*M);
__m256d rax = _mm256_fmadd_pd(A_v, B_v, C_v);
_mm256_store_pd(C + m + n*M, rax);
}
}
}
}
This reaches a maximum of around 23 GFLOPs with M=N=32. When decreasing N and M the performance drops.
After thinking about it for some time, I realised that the L1d cache of the EPYC 7742 is of 32KiB which corresponds to 4096 doubles. Ideally I want all three matrices to be loaded into the L1 cache.
This yields the following optimization problem:
Maximise N>0,M>0 such that N*M + 128*N + 128 * M ≤ 4096.
I noticed that M = 4, N = 8 is a good enough solution that uses power-of-two values.
Given that M=4, I can eliminate the m-loop by keeping a single vector as an accumulation variable.
This idea yielded the following code:
void mat_mul(double* A, double* B, double* C) {
__m256d A_v, B_v;
register __m256d C_v;
double* aPtr;
double* bPtr;
double* cPtr;
for (int n = 0; n < N; ++n) {
cPtr = C + n*M;
C_v = _mm256_load_pd(cPtr);
for (int k = 0; k < K; ++k) {
aPtr = A + k*M;
bPtr = B + k + n*K;
B_v = _mm256_broadcast_sd(bPtr);
A_v = _mm256_load_pd(aPtr);
C_v = _mm256_fmadd_pd(A_v, B_v, C_v);
}
_mm256_store_pd(cPtr, C_v);
}
}
I thought this was going to be much faster, however the performance I get is around 4 GFLOPs, which is identical to running the first code with suboptimal M=4, N=8.
Given that in the first case the matrices are too big to fit entirely in L1d cache, this seems to suggest that the second version of the code is fetching and writing data to L2 cache, even though the matrices should fit in L1d cache.
Is my suspicion correct? If so, is this behaviour caused by some mistake in my thought process, or is there some reason I am missing behind this?
Please give some explanation, rather than just posting a better optimised version of the code, as I would really like to better understand what is going on conceptually. Ideally it would be nice if you guys could give me a few hints on things I could look into myself.
Thanks :)
EDIT 1:
I tried following some of the tips that were suggested in the comments by @doug and @ElderBug.
@doug in particular suggested to try and transpose B, however given that the matrices are in column-major format I could not find a way to implement their idea. What I did instead was transpose A and accumulate products in a tmp variable.
Here is what I came up with:
void mat_mul(double* A, double* B, double* C) {
__m256d B_v, A_v, rax;
// Transpose A
double AT[K*M];
for(int k=0; k < K; ++k){
for(int m=0; m<M; ++m){
AT[m*K + k] = A[m + k*M];
}
}
// Compute product
for (int n = 0; n < N; ++n) {
for (int m = 0; m < M; ++m) {
double tmp = 0.;
for (int k = 0; k < K; k+=4) {
B_v = _mm256_load_pd(B + n*K + k);
A_v = _mm256_load_pd(AT + m*K + k);
rax = _mm256_mul_pd(A_v, B_v);
double* pv = (double*)&rax;
tmp += pv[0] + pv[1] + pv[2] + pv[3];
}
C[n*M + m] = tmp;
}
}
}
This still runs at around 4 GFLOPs with M=4, N=8. The elephant in the room seems to be reducing the rax vector. I was not able to find a different way to do that more efficiently.
If I remove tmp += pv[0] + pv[1] + pv[2] + pv[3];
the performance doubles, but I reach a peak of 14 GFLOPs with M=N=32, so this is still worse than my naive vectorisation approach.
If anyone has any further suggestion/comments they would be very much appreciated.
Edit 2:
I forgot to mention that I am compiling the code using icc (ICC) 2021.5.0 20211109
with the following optimisation flags:
-O3 -funroll-loops -std=c++11 -mavx2 -march=native -fma -ftz -fomit-frame-pointer
Edit 3:
The goal is to implement this serial micro-kernel in the best possible way so that I can then re-use it for blocked parallel matrix multiplication. According to my calculations the theoretical peak should be 46 GFLOPS so I am getting about 50%. OpenBLAS gets around 40 GFLOPs so 32-35ish would be nice.
It would be really nice if someone could give some insight on why some of the options I tried do not work so that I can think about how to fix them.
EDIT 4:
Thanks to @ElderBug's comment I put some extra thought in how the store operations in my first vectorised implementation were being managed and by doing some clever unrolling I was able to get to 40GFLOps which is comparable to OpenBLAS!
Reading this also helped: https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemm-microkernel
void mat_mul(double* __restrict__ A, double* __restrict__ B, double* C)
. The__restrict__
keyword tells the compiler that A, B and C are separate matrices, and this enables more optimizations. How does that change the results on your system? – Pre__restrict__
to all versions ofmat_mul
and obtained no significant speedup. I forgot to mention I am compiling usingicc (ICC) 2021.5.0 20211109
. – Owing__restrict
instead. Try that. – Pre