How to tell GCC's target_clones to compile for all SIMD levels?
Asked Answered
D

0

4

GCC has a function attribute target_clones which can be used to create different versions of a function that are compiled to use different instruction sets in such a way that, when the binary is executed, the version with the highest-level instruction set is selected to execute.

Assuming I have some piece of piece of code doing lots of floating point operations, I can have it use the highest-level SIMD instruction set by writing something like this:

__attribute__((target_clones("default", "sse", "sse2", "sse3", "ssse3", "sse4.1", "sse4.2", "avx", "avx2", "avx512")))
double my_ddot1(int n, double x[], double y[])
{
    double result = 0;
    for (int i = 0; i < n; i++)
        result += x[i] * y[i];
    return result;
}

But that involved manually-specifying all the possible instruction sets that I want it to specialize for.

Now, assuming n is large, the code above is something that will evidently just run faster the more operations are done at once, so it just has to generate versions for each SIMD level (sse2/3/4, avx1/2). I can manually list the available ones for x64-64 and put them in the function attribute, but in 10 years time the repertoire of available options is likely to grow - e.g. if an "avx2048" gets created later, which will evidently benefit the function being optimized, the code above will not pick it, getting stuck instead with "avx2".

How can I tell target_clones to compile for every SIMD level without having to list them, in such a way that the list would automatically update according to what the compiler supports?

Danyelldanyelle answered 5/2, 2022 at 18:10 Comment(3)
The sticky point here is that SSE3 through SSE4 versions of this particular function would be pure bloat (and very likely compile to identical asm as the SSE2 version). So would AVX2 since AVX1 already provides FP SIMD for 256-bit vectors. And SSE1 doesn't provide double-precision operations at all, only single; it would need to use x87 for that version. (SSE2 is baseline for x86-64, but not necessarily for 32-bit mode)Creese
Your code will likely only profit from vectorization if you compile with some fast-math optimizations enabled (I guess associative-math). Also, I think gcc would use FMA if available, but clang would not (unless fast-math is enabled).Criminal
@Criminal Sure, but those can safely go to the compilation arguments without having to worry about compatibility of the generated binaries. Alternatively, pragmas and attribute "optimize" also do the trick.Danyelldanyelle

© 2022 - 2024 — McMap. All rights reserved.