Why is there no SIMD functionality in the C++ standard library?
Asked Answered
C

1

11

SSE has been around since 1999 and it and its following extensions are one of the most powerful tools for improving the performance of your C++ program. Yet there is no standardized containers/algorithms etc. that make explicit use of this ( that I am aware of ? ). Is there a reason for this? Was there a proposal that never made it through?

Circuit answered 17/12, 2019 at 12:3 Comment(19)
Are you aware of std::execution policies? Most algorithms from <algorithm> library can use them.Cockcrow
isn't that for multithreadingCircuit
SSE intrinsics are x86 specific, while the C++ standard is, for the most part, portable across platforms. A cross-platform SIMD library would probably do a worse job than autovectorization in most people's hands, and people who could effectively use manual SIMD would probably pass it over for the low-level intel intrinsics.Lashonda
What would the SIMD functionality look like for, say, a Z80 processor? (used in the original Game Boy)Atrice
Note that the compiler does allow you to use SIMD implicitly. godbolt.org/z/GSCRcB Wherever that is not enough, you are probably in the realm of hand-tuning to some (class of) CPU anyway.Embraceor
yes I am aware of implicit vectorization, but that's never guaranteed and changing 1 line of code in your program somewhere could stop the compiler from vectorizing anythingCircuit
@Atrice fall back to scalar code?Circuit
@Lashonda but that statement holds for basically everything in the standard library, that's one of the downsides of standardizing somethingCircuit
Could you clarify what "containers/algorithms" you have in mind?Uniat
github.com/xtensor-stack/xsimd I like the approach of this library @UniatCircuit
Hey, well how 'bout that, looks like they're baking something: en.cppreference.com/w/cpp/experimental/simdLashonda
and there's already std::valarray that you can use instead of waiting for std::experimental::simdSerenaserenade
also, regarding the execution policies in C++20 (based on Intel TBB), the unseq* policy loosens some conditions in the std algorithms that enables some SIMD vectorization, while the par* policy enables threading: oreilly.com/library/view/c-high-performance/9781787120952/…Lashonda
@Yamahari: allowing a reduction over FP values to not add them up in any particular order is necessary for a compiler to vectorize without -ffast-math. Integer math is already associative, though. Data parallelism can be turned into thread-level parallelism with threads, instruction-level parallelism with multiple accumulators, and SIMD with vectorization. Preferably all 3 at once because they're orthogonal: all cores running SIMD FMAs at 2/clock gets a lot of work done.Experientialism
did not know about experimental::simd !Circuit
@Botje: On a target without hardware SIMD, source-level SIMD would turn into loops over elements of 4-element structs, or something like that. Fully unrolled or folded into an outer loop as appropriate. (Or something worse than that if the compiler does a poor job.) For example, GNU C has a native vector syntax that can compile for any target, with or without actual SIMD. gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html. (__m128 and so on for x86 are defined in terms of that in GNU C; that's why you can do __m128 x,y; x += y; in GNU C like _mm_add_ps)Experientialism
processor-specific optimizations don't belong in a language, they belong in the implementation thereof.Chyack
Controversial opinion, but SSE really isn't designed generically. Think about it like this. Let's say you had a template function that already works with integers, and you want it to work on vectors as well. Even with the existing 3rd party simd libraries, you can't just drop it in and expect it to work, as a lot of operations were simply not implemented.Gastight
std::valarray was made for simd functionality, but it turned out that devs were so bad at using it that it usually made programs run slower rather than faster.Flood
D
7

There is experimental support in the (parallelism TS v2) for explicit short vector SIMD types that map to the SIMD extensions of common ISAs but only GCC implements it as of August 2021. The Cppreference documentation for it linked above is incomplete but there are additional details covered in Working Draft, Technical Specification for C++ Extensions for Parallelism, Document N4808. The ideas behind this proposal were developed during a PhD project (2015 thesis here). The author of the GCC implementation wrote an article on converting an existing SSE string processing algorithm to use a 2019 iteration of his library, achieving similar performance and much greater readability. Here's some simple code using it and the generated assembly:

Multiply-add

#include <experimental/simd> // Fails on MSVC 19 and others
using vec4f = std::experimental::fixed_size_simd<float,4>;

void madd(vec4f& out, const vec4f& a, const vec4f& b)
{
    out += a * b;
}

Compiling with -march=znver2 -Ofast -ffast-math we do get a hardware fused multiply-add generated for this:

madd(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&):
        vmovaps xmm0, XMMWORD PTR [rdx]
        vmovaps xmm1, XMMWORD PTR [rdi]
        vfmadd132ps     xmm0, xmm1, XMMWORD PTR [rsi]
        vmovaps XMMWORD PTR [rdi], xmm0
        ret

Dot Product

A dot/inner product can be written tersely:

float dot_product(const vec4f a, const vec4f b)
{
    return reduce(a * b);
}

-Ofast -ffast-math -march=znver2:

dot_product(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >):
        vmovaps xmm1, XMMWORD PTR [rsi]
        vmulps  xmm1, xmm1, XMMWORD PTR [rdi]
        vpermilps       xmm0, xmm1, 27
        vaddps  xmm0, xmm0, xmm1
        vpermilpd       xmm1, xmm0, 3
        vaddps  xmm0, xmm0, xmm1
        ret

(Godbolt link with some more playing around).

Dieback answered 5/8, 2021 at 20:1 Comment(4)
Interesting. gcc -O2 doesn't include auto-vectorization, so that use of SIMD is proof that it's defined with __attribute__((vector_size(16))). However, godbolt.org/z/79KddEdba shows that even if you pass by value in C, it's not passed in xmm registers the way a plain __m128 non-class typedef is, in the x86-64 System V ABI, and even returning it happens by pointer :( Fortunately that doesn't matter after inlining, but you normally don't want to pass SIMD vectors by reference.Experientialism
@PeterCordes the code contains a comment "The following ensures, function arguments are passed via the stack. This is important for ABI compatibility across TU boundaries" :-(Appetency
There is more mention of the ODR, TU, inline issues here ("The majority of functions are marked as [[gnu::always_inline]] to enable quasi-ODR-conforming linking of TUs with different -m flags."): gcc.1065356.n8.nabble.com/…Dieback
"> > Note that excessive use of always_inline can cause compile-time issues > > (see for example PR99785). > > Ah, I should verify whether that's also the reason my stdx::simd > implementation is slow to compile." gcc.1065356.n8.nabble.com/…Dieback

© 2022 - 2024 — McMap. All rights reserved.