SSE has been around since 1999 and it and its following extensions are one of the most powerful tools for improving the performance of your C++ program. Yet there is no standardized containers/algorithms etc. that make explicit use of this ( that I am aware of ? ). Is there a reason for this? Was there a proposal that never made it through?
There is experimental support in the (parallelism TS v2) for explicit short vector SIMD types that map to the SIMD extensions of common ISAs but only GCC implements it as of August 2021. The Cppreference documentation for it linked above is incomplete but there are additional details covered in Working Draft, Technical Specification for C++ Extensions for Parallelism, Document N4808. The ideas behind this proposal were developed during a PhD project (2015 thesis here). The author of the GCC implementation wrote an article on converting an existing SSE string processing algorithm to use a 2019 iteration of his library, achieving similar performance and much greater readability. Here's some simple code using it and the generated assembly:
Multiply-add
#include <experimental/simd> // Fails on MSVC 19 and others
using vec4f = std::experimental::fixed_size_simd<float,4>;
void madd(vec4f& out, const vec4f& a, const vec4f& b)
{
out += a * b;
}
Compiling with -march=znver2 -Ofast -ffast-math
we do get a hardware fused multiply-add generated for this:
madd(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&):
vmovaps xmm0, XMMWORD PTR [rdx]
vmovaps xmm1, XMMWORD PTR [rdi]
vfmadd132ps xmm0, xmm1, XMMWORD PTR [rsi]
vmovaps XMMWORD PTR [rdi], xmm0
ret
Dot Product
A dot/inner product can be written tersely:
float dot_product(const vec4f a, const vec4f b)
{
return reduce(a * b);
}
-Ofast -ffast-math -march=znver2
:
dot_product(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >):
vmovaps xmm1, XMMWORD PTR [rsi]
vmulps xmm1, xmm1, XMMWORD PTR [rdi]
vpermilps xmm0, xmm1, 27
vaddps xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 3
vaddps xmm0, xmm0, xmm1
ret
gcc -O2
doesn't include auto-vectorization, so that use of SIMD is proof that it's defined with __attribute__((vector_size(16)))
. However, godbolt.org/z/79KddEdba shows that even if you pass by value in C, it's not passed in xmm registers the way a plain __m128
non-class typedef is, in the x86-64 System V ABI, and even returning it happens by pointer :( Fortunately that doesn't matter after inlining, but you normally don't want to pass SIMD vectors by reference. –
Experientialism [[gnu::always_inline]]
to enable quasi-ODR-conforming linking of TUs with different -m
flags."): gcc.1065356.n8.nabble.com/… –
Dieback © 2022 - 2024 — McMap. All rights reserved.
std::execution
policies? Most algorithms from<algorithm>
library can use them. – Cockcrowstd::valarray
that you can use instead of waiting forstd::experimental::simd
– Serenaserenade-ffast-math
. Integer math is already associative, though. Data parallelism can be turned into thread-level parallelism with threads, instruction-level parallelism with multiple accumulators, and SIMD with vectorization. Preferably all 3 at once because they're orthogonal: all cores running SIMD FMAs at 2/clock gets a lot of work done. – Experientialism__m128
and so on for x86 are defined in terms of that in GNU C; that's why you can do__m128 x,y; x += y;
in GNU C like_mm_add_ps
) – Experientialismstd::valarray
was made for simd functionality, but it turned out that devs were so bad at using it that it usually made programs run slower rather than faster. – Flood