Why is there no SIMD functionality in the C++ standard library? - McMap

About

Why is there no SIMD functionality in the C++ standard library?

Asked 17/12, 2019 at 12:3 Answered 5/8, 2021 at 20:1

Solved c++stl simd

C

1

11

SSE has been around since 1999 and it and its following extensions are one of the most powerful tools for improving the performance of your C++ program. Yet there is no standardized containers/algorithms etc. that make explicit use of this ( that I am aware of ? ). Is there a reason for this? Was there a proposal that never made it through?

Circuit answered 17/12, 2019 at 12:3 Comment(19)

Are you aware of std::execution policies? Most algorithms from <algorithm> library can use them. – Cockcrow 17/12, 2019 at 12:5

isn't that for multithreading – Circuit 17/12, 2019 at 12:7

SSE intrinsics are x86 specific, while the C++ standard is, for the most part, portable across platforms. A cross-platform SIMD library would probably do a worse job than autovectorization in most people's hands, and people who could effectively use manual SIMD would probably pass it over for the low-level intel intrinsics. – Lashonda 17/12, 2019 at 12:12

What would the SIMD functionality look like for, say, a Z80 processor? (used in the original Game Boy) – Atrice 17/12, 2019 at 12:13

Note that the compiler does allow you to use SIMD implicitly. godbolt.org/z/GSCRcB Wherever that is not enough, you are probably in the realm of hand-tuning to some (class of) CPU anyway. – Embraceor 17/12, 2019 at 12:17

yes I am aware of implicit vectorization, but that's never guaranteed and changing 1 line of code in your program somewhere could stop the compiler from vectorizing anything – Circuit 17/12, 2019 at 12:25

@Atrice fall back to scalar code? – Circuit 17/12, 2019 at 12:25

@Lashonda but that statement holds for basically everything in the standard library, that's one of the downsides of standardizing something – Circuit 17/12, 2019 at 12:29

Could you clarify what "containers/algorithms" you have in mind? – Uniat 17/12, 2019 at 12:34

github.com/xtensor-stack/xsimd I like the approach of this library @Uniat – Circuit 17/12, 2019 at 12:38

Hey, well how 'bout that, looks like they're baking something: en.cppreference.com/w/cpp/experimental/simd – Lashonda 17/12, 2019 at 13:32

and there's already std::valarray that you can use instead of waiting for std::experimental::simd – Serenaserenade 17/12, 2019 at 13:51

also, regarding the execution policies in C++20 (based on Intel TBB), the unseq* policy loosens some conditions in the std algorithms that enables some SIMD vectorization, while the par* policy enables threading: oreilly.com/library/view/c-high-performance/9781787120952/… – Lashonda 17/12, 2019 at 13:53

@Yamahari: allowing a reduction over FP values to not add them up in any particular order is necessary for a compiler to vectorize without -ffast-math. Integer math is already associative, though. Data parallelism can be turned into thread-level parallelism with threads, instruction-level parallelism with multiple accumulators, and SIMD with vectorization. Preferably all 3 at once because they're orthogonal: all cores running SIMD FMAs at 2/clock gets a lot of work done. – Experientialism 17/12, 2019 at 14:54

did not know about experimental::simd ! – Circuit 17/12, 2019 at 14:59

@Botje: On a target without hardware SIMD, source-level SIMD would turn into loops over elements of 4-element structs, or something like that. Fully unrolled or folded into an outer loop as appropriate. (Or something worse than that if the compiler does a poor job.) For example, GNU C has a native vector syntax that can compile for any target, with or without actual SIMD. gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html. (__m128 and so on for x86 are defined in terms of that in GNU C; that's why you can do __m128 x,y; x += y; in GNU C like _mm_add_ps) – Experientialism 17/12, 2019 at 15:1

processor-specific optimizations don't belong in a language, they belong in the implementation thereof. – Chyack 5/8, 2021 at 21:6

Controversial opinion, but SSE really isn't designed generically. Think about it like this. Let's say you had a template function that already works with integers, and you want it to work on vectors as well. Even with the existing 3rd party simd libraries, you can't just drop it in and expect it to work, as a lot of operations were simply not implemented. – Gastight 5/8, 2021 at 21:12

std::valarray was made for simd functionality, but it turned out that devs were so bad at using it that it usually made programs run slower rather than faster. – Flood 5/8, 2021 at 21:21

D

7

There is experimental support in the (parallelism TS v2) for explicit short vector SIMD types that map to the SIMD extensions of common ISAs but only GCC implements it as of August 2021. The Cppreference documentation for it linked above is incomplete but there are additional details covered in Working Draft, Technical Specification for C++ Extensions for Parallelism, Document N4808. The ideas behind this proposal were developed during a PhD project (2015 thesis here). The author of the GCC implementation wrote an article on converting an existing SSE string processing algorithm to use a 2019 iteration of his library, achieving similar performance and much greater readability. Here's some simple code using it and the generated assembly:

Multiply-add

#include <experimental/simd> // Fails on MSVC 19 and others
using vec4f = std::experimental::fixed_size_simd<float,4>;

void madd(vec4f& out, const vec4f& a, const vec4f& b)
{
    out += a * b;
}

Compiling with -march=znver2 -Ofast -ffast-math we do get a hardware fused multiply-add generated for this:

madd(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&):
        vmovaps xmm0, XMMWORD PTR [rdx]
        vmovaps xmm1, XMMWORD PTR [rdi]
        vfmadd132ps     xmm0, xmm1, XMMWORD PTR [rsi]
        vmovaps XMMWORD PTR [rdi], xmm0
        ret

Dot Product

A dot/inner product can be written tersely:

float dot_product(const vec4f a, const vec4f b)
{
    return reduce(a * b);
}

-Ofast -ffast-math -march=znver2:

dot_product(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >):
        vmovaps xmm1, XMMWORD PTR [rsi]
        vmulps  xmm1, xmm1, XMMWORD PTR [rdi]
        vpermilps       xmm0, xmm1, 27
        vaddps  xmm0, xmm0, xmm1
        vpermilpd       xmm1, xmm0, 3
        vaddps  xmm0, xmm0, xmm1
        ret

(Godbolt link with some more playing around).

Dieback answered 5/8, 2021 at 20:1 Comment(4)

Interesting. gcc -O2 doesn't include auto-vectorization, so that use of SIMD is proof that it's defined with __attribute__((vector_size(16))). However, godbolt.org/z/79KddEdba shows that even if you pass by value in C, it's not passed in xmm registers the way a plain __m128 non-class typedef is, in the x86-64 System V ABI, and even returning it happens by pointer :( Fortunately that doesn't matter after inlining, but you normally don't want to pass SIMD vectors by reference. – Experientialism 5/8, 2021 at 20:57

@PeterCordes the code contains a comment "The following ensures, function arguments are passed via the stack. This is important for ABI compatibility across TU boundaries" :-( – Appetency 5/8, 2021 at 21:16

There is more mention of the ODR, TU, inline issues here ("The majority of functions are marked as [[gnu::always_inline]] to enable quasi-ODR-conforming linking of TUs with different -m flags."): gcc.1065356.n8.nabble.com/… – Dieback 5/8, 2021 at 21:22

"> > Note that excessive use of always_inline can cause compile-time issues > > (see for example PR99785). > > Ah, I should verify whether that's also the reason my stdx::simd > implementation is slow to compile." gcc.1065356.n8.nabble.com/… – Dieback 5/8, 2021 at 21:23

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.