Multi-threaded fixed-size matrix-vector multiplication optimized for many-core CPUs with non-uniform caches
Asked Answered
S

2

4

I would like to implement a parallel matrix-vector multiplication for a fixed size matrix (~3500x3500 floats) optimized for my CPUs and cache layout (AMD Zen 2/4) that is repeatedly executed for changing input vectors (set up time is not critical, sustained performance is). Programming language is C++.

Can anyone point me at good (perhaps optimal) strategies how to partition the matrix and the threads with respect to cache utilization and synchronization (reduction +=) overhead? Like what block size is best, and how to traverse the multiplication best with several threads? I'd then try to apply the strategy to my particular CPUs.

I'm free to duplicate matrix data for cache efficiency across multiple CCXs, and the matrix doesn't need to be contiguous in RAM either. I can choose any format and order that promises best efficiency.

Alternatively, I appreciate, too, if anybody knows of such a library or is able to share code. Don't need to reinvent things :)

Thanks.

Strontia answered 25/2, 2023 at 1:2 Comment(5)
Why not using BLAS libraries? They are perfectly made for this, and are highly-optimized since decades for many specific platforms. Reinventing the wheel does not seems a good idea. You can try BLIS for example. Actually, AMD recommand using it on their own CPUs.Golightly
I have tried several BLAS libraries. BLIS is not multi-threaded for level-2 operations. Slicing the matrix myself with several smaller BLIS multiplications doesn't perform well. OpenBLAS is multi-threaded, but doesn't perform (scale) well. It has no knowledge of the cache layout. Finally, I tried with MKL, which performs a lot better than OpenBLAS, but still has several issues - apart from the risk that Intel doesn't support AMD, and anytime it could become impossible to run MKL well performing on AMD.Strontia
Even MKL performance is probably not optimal because they doubt they optimize for the Zen architecture. In general, while BLAS has been around for a long time, I picture that most the famous and accessible implementations are not made for highly parallel MVMs on modern many-core CPUs. Also, BLAS need to setup the internals for each sgemv() call. BLAS API is tailored around matrices stored monolithic, and the do not reorder the data if beneficial. There's no such thing like a plan as in FFTW. BLAS is not optimized for repeated multiplications of the same matrix with a new vector.Strontia
Finally, a compile-time sized MVM is leaves more room for optimization than any dynamic algorithm can.Strontia
"Actually, AMD recommend using it on their own CPU", in my experience, everything that AMD recommends or optimized (FFTW, AOCC, etc) has no benefit over vanilla versions at best, or is even slower. I haven't found anything they recommend to improve the performance at the array sizes I work with.Strontia
S
1

Here's a multi-threaded SIMD 32-bit MVM implementation, based on @Soonts suggestions, that performs somewhat decently on an AMD 7742.

https://gist.github.com/eastwindow/022668c9b495d932dc49dd8c45c81a7d

4096 x 4096 matrix: 30 us

3456 x 3456 matrix: 23 us

2048 x 2048 matrix: 7 us

1632 x 1632 matrix: 5.5 us

Synchronization overhead of 64 threads (no MVM): 2 us

All timings are averages of 1,000,000 repetitions with no break between repetitions.

However, with a break between repetitions, e.g. set wait = 200000; and remove multiplyInner_omp_simd_a() in void stay_busy( int threadnr ), the MVMs take a lot longer. Adding _mm_pause() doesn't help, neither do other things that I tried to keep the AVX units busy.

Is there another way to keep the MVM fast after a break other than doing the MVM continuously?

(The system is well tuned for low-latency and determinism. The cores on the socket are completely isolated, any power-saving feature in the CPU and frequency up- and down-scaling is disabled)

Another thing I find noteworthy is the synchronization with atomics. If each atomic is on its own cache line, the synchronization takes the longest (4 us). Using 4 atomics per cache line results in the lowest overhead, 2 us, despite the false sharing. It seems as if the cache coherence protocols are executed faster than loading from 64 separate cache lines.

Strontia answered 8/3, 2023 at 22:44 Comment(0)
A
4

First try Eigen. Depending on compiler, you might need manually define macros for proper SIMD, for Zen 2-3 you’d want EIGEN_VECTORIZE_AVX, EIGEN_VECTORIZE_FMA and EIGEN_VECTORIZE_AVX2, for Zen 4 also EIGEN_VECTORIZE_AVX512.
Also, be sure to enable OpenMP in project settings.

If you wanna try improving performance further, your #1 goal is saving memory bandwidth. Multiplying matrix by vector is practically guaranteed to bottleneck on memory, not compute.

Reshape the matrix into panels, like that.

enter image description here

The numbers in the table are 0-based indices of the elements in memory.
Only instead of 4, use panel height = 32 for AVX, or 64 for AVX512.
Also, don’t forget to align data by at least vector size, ideally by 64 bytes (cache line)

Note the last panel of the matrix probably needs zero-padding of these columns. And ideally, the output vectors also need a few extra elements to make their length a multiple of the panel height, otherwise you need special code to handle the last panel of the matrix.

In the inner loop, do something like that, untested.

// Compute product of width*32 matrix by vector of length `width`,
// the result is vector of length 32
void multiplyInner_avx( const float* mat, const float* vec, size_t width, float* rdi )
{
    // Initialize the accumulators
    __m256 acc0 = _mm256_setzero_ps();
    __m256 acc1 = _mm256_setzero_ps();
    __m256 acc2 = _mm256_setzero_ps();
    __m256 acc3 = _mm256_setzero_ps();

    // Compute these products
    const float* const vecEnd = vec + width;
    while( vec < vecEnd )
    {
        const __m256 v = _mm256_broadcast_ss( vec );
        vec++;

        acc0 = _mm256_fmadd_ps( v, _mm256_load_ps( mat ), acc0 );
        acc1 = _mm256_fmadd_ps( v, _mm256_load_ps( mat + 8 ), acc1 );
        acc2 = _mm256_fmadd_ps( v, _mm256_load_ps( mat + 16 ), acc2 );
        acc3 = _mm256_fmadd_ps( v, _mm256_load_ps( mat + 24 ), acc3 );
        mat += 32;
    }

    // Store the products
    _mm256_store_ps( rdi, acc0 );
    _mm256_store_ps( rdi + 8, acc1 );
    _mm256_store_ps( rdi + 16, acc2 );
    _mm256_store_ps( rdi + 24, acc3 );
}

For Zen 4 you gonna need another version of the above, to leverage AVX512 vectors.

In the outer loop, divide the matrix into batches of approximately equal size, so that count of batches equals to the count of hardware threads in your CPU. Dispatch each batch into different CPU threads, an easy way to do that is OpenMP.

Ideally, make sure the process is stable, i.e. that when you call your multiplication function for different vectors, same batches of the input matrix are dispatched into the same CPU cores.

Note there’s no reduction anywhere, and no need for that. Different CPU cores simply compute different slices of the output vector, by multiplying different slices of the input matrix by the complete input vector.

The magic numbers are more or less arbitrary. Experiment to find the best ones. For instance, it’s probably a good idea to increase panel height by a factor of 2, making multiplyInner_avx function to produce 64 elements of the output, at the cost of 4 more vector load + FMA instructions inside the loop.


Once you get a first working version, one possible micro-optimization — manually unroll the inner loop by 4, use _mm256_broadcast_ps to load 4 vector elements with a single instruction, then _mm256_permute_ps with constants _MM_SHUFFLE(i,i,i,i) where i in [0..3] interval to broadcast each of the 4 loaded vector elements into another vector.
After each permute, multiply/accumulate the broadcasted vector by the current column of the matrix, and advance the matrix pointer to the next column.

For the last 1-3 elements of the vector, use the current version with _mm256_broadcast_ss

The _mm256_permute_ps instruction is cheap, and most importantly on AMD it can run on vector port #2 (on Zen 4 also on port #3), these ports aren’t used by anything else in that loop.

I’m not 100% sure this gonna help because more instructions to decode and dispatch. But it might, because saves some memory loads.

Arson answered 25/2, 2023 at 19:31 Comment(21)
Great, thanks! I think I get the idea: multiply and accumulate each matrix column with the vector elements in parallel along the rows of the matrix (both data parallel (SIMD) and task parallel (threads)), that is column after column, with a reordered matrix such that matrix data is contiguous for this access sequence (and well aligned and correspondingly local in the cache L3 cache for the threads). Makes sense to me and sounds like this could be most efficient. I'll implement that and report.Strontia
@Strontia Right, the main performance wins of this method are sequential vectorized loads from the matrix, and the fact that every loaded element from the vector is reused to compute 32 output elements. Your vector only uses ~14 kb of memory (as opposed to ~47 MB matrix), it’s OK duplicate the vector into L1d/L2 caches of every core.Arson
Soonts, based on your original function, I wrote an implementation for a 2048x2048 matrix and 64 threads (to start with) using my own threads that is about 20% faster than MKL for the same matrix size and number of threads :-) So it looks promising and I'll generalize it. Also, it doesn't have this problem stackoverflow.com/questions/75548704/… :-) This alone is huge benefit.Strontia
@Strontia Nice. However, if you have that many hardware threads, the 32-tall panels might result in too few panels in the matrix, not enough to evenly split the work across threads. You could try reducing panel height by half, and the following function for the inner loop: gist.github.com/Const-me/2b383e9129c2f343fb874c56055d189eArson
One more note, if you’re on Windows, consider the built-in thread pool. Example there: github.com/Const-me/Whisper/blob/master/Whisper/Utils/… and the corresponding parallelFor.cpp source file in the same folder. I don’t really know how they implemented waits, but maybe with GetQueuedCompletionStatus kernel API.Arson
Soonts, nice! This is even faster for the 2048x2048 matrix on 64 cores. MKL is about 12.5 us in average of 1,000,000 multiplications, and your latest version is down to about 9.5 us.Strontia
@Strontia Try another version. Same panel height 16, unrolled the loop more, might help: gist.github.com/Const-me/15111c4f9502b001eef31ffde4aa7770 BTW, note all these versions are producing slightly different results due to different summation order, and FP precision shenanigans. MKL probably uses yet another one.Arson
Soonts, very nice! Down to 9.0 us! I'm not observing statistically significant differences between ICC 2021.8, GCC 10.2.1, Clang 11.0, and AOCC 4.0 (clang 14).Strontia
Soonts, could you explain why using more accumulators is beneficial? Thanks!Strontia
Btw, I get the same performance if I split the 64 threads over two sockets. This is great, and also something that I didn't get with MKL or OpenBLAS and always wondered why.Strontia
@Strontia About the accumulators and unrolling, I think the main reason is pipelining. A processor core runs many instructions in parallel, but it can’t start computing a+=b*c; before it got the previous value of the a. Modern AMD processors can complete 2 of these FMAs every cycle in each core, but the latency is 4-5 cycles in the best case, 5 cycles on Zen2, 4 cycles on the newer ones. When you have multiple independent accumulators like a0 += b*c0; a1 += b*c1; a2 += b*c2; a3 += b*c3; these 4 instructions don’t have data dependencies between them, all 4 of them will run in parallel.Arson
About the sockets, I don’t know the reason. I have limited experience with MKL, and I don’t know how specifically they parallelize that matrix*vector multiplication in their library.Arson
Soonts, I'm not quite understanding how to interpret the picture with the ordering above. For the multiplyInner_avx() code, the matrix would need to be stored in column-major format (including padding if needed), isn't that correct? However, this isn't how I'm reading the picture. What am I confusing?Strontia
@Strontia The matrix needs to be reordered into a sequence of panels, of height 16. Each panel is indeed a column-major matrix of size width*16, but the complete matrix composed from a sequence of these panels is neither row major nor column major. On the picture the panel height is 4, otherwise the illustration would be too large to be readable, and the illustration has 2 of these panels. And yes, the last panel of the matrix indeed very likely to require zero padding, unless the height of the matrix is a multiple of 16.Arson
One more thing, I’m not sure how specifically MKL treats input matrices for your product. It could be you need to transpose the matrix before producing the panels, possible to do on the fly while reshaping.Arson
Ah, okay, so your two rows of panels are for two different threads? That's what I implemented. Each thread gets a column-major re-ordered portion of the original matrix.Strontia
@Strontia Yeah, after you reshape the matrix that way, assign continuous subsets of these panels to different OS threads, to parallelize things. If your matrix is so small that it only has 2 panels, you’ll only be able to leverage 2 threads.Arson
I won't be using MKL in this application, just as a reference benchmark, but your intrinsics code (not sure which one yet). Using my own threads, etc, no MKL, or OpenMP.Strontia
@Strontia Different libraries are using different conventions. When I wrote my answer, I assumed you want a vector with elements equal to dot products of input vector and matrix rows. It could be the opposite, dot products of input vector and matrix columns (some libraries are doing that when you ask them to multiply vector*matrix, as opposed of matrix*vector). If that’s the case my answer’s still good, but you need different reshaping function for the matrix.Arson
Your assumption was correctStrontia
Added F16C and AVX-512 versions.Strontia
S
1

Here's a multi-threaded SIMD 32-bit MVM implementation, based on @Soonts suggestions, that performs somewhat decently on an AMD 7742.

https://gist.github.com/eastwindow/022668c9b495d932dc49dd8c45c81a7d

4096 x 4096 matrix: 30 us

3456 x 3456 matrix: 23 us

2048 x 2048 matrix: 7 us

1632 x 1632 matrix: 5.5 us

Synchronization overhead of 64 threads (no MVM): 2 us

All timings are averages of 1,000,000 repetitions with no break between repetitions.

However, with a break between repetitions, e.g. set wait = 200000; and remove multiplyInner_omp_simd_a() in void stay_busy( int threadnr ), the MVMs take a lot longer. Adding _mm_pause() doesn't help, neither do other things that I tried to keep the AVX units busy.

Is there another way to keep the MVM fast after a break other than doing the MVM continuously?

(The system is well tuned for low-latency and determinism. The cores on the socket are completely isolated, any power-saving feature in the CPU and frequency up- and down-scaling is disabled)

Another thing I find noteworthy is the synchronization with atomics. If each atomic is on its own cache line, the synchronization takes the longest (4 us). Using 4 atomics per cache line results in the lowest overhead, 2 us, despite the false sharing. It seems as if the cache coherence protocols are executed faster than loading from 64 separate cache lines.

Strontia answered 8/3, 2023 at 22:44 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.