specify simd level of a function that compiler can use
Asked Answered
B

2

4

I wrote some code and compiled it using gcc with the native architecture option.

Typically I can take this code and run it on an older computer that doesn't have AVX2 (only AVX), and it works fine. It seems however that the compiler is actually emitting AVX2 instructions (finally!), rather than me needing to include SIMD intrinsics myself.

I'd like to modify the program so that both pathways are supported (AVX2 and non-AVX2). In other words I'd like something the following pseudocode.

if (AVX2){
   callAVX2Version();
}else if (AVX){
   callAVXVersion();
}else{
   callSSEVersion();
}

void callAVX2Version(){
#pragma gcc -mavx2
}

void callAVXVersion(){
#pragma gcc -mavx
}

I know how to do the runtime detection part, my question is whether it is possible to do the function specific SIMD selection part.

Blab answered 6/5, 2019 at 15:13 Comment(4)
What about using target gcc function attribute? Just do void callAVXVersion() attribute(target("avx,no-avx2,no-sse")) { some implementation } void callAVX2Version() { attribute(target("avx2,no-sse")) { some other implementation }. Never tried it tho, but I was always curios how it works. You can also #pragma gcc target avx2 or #pragma gcc target avx.Excrement
I was unaware of function attributes. Where exactly does the attribute go? It seems like they require a separate function declaration rather than getting attached to the implementation?Blab
Without resorting to compiler specific extensions or pragmas, you can put each architecture specific function into its own source module and compile it with the appropriate architecture flags.Beattie
Where exactly does the attribute go? - near function definition or declaration. __attribute__((I put them here)) int __attribute__((you can put it here))__ func(int a, int b) __atttribute__((or here, does not matter)) {}. The description is clear - use target attribute to specify that a function is to be compiled with different target options than specified on the command line.Excrement
D
6

The simple and clean Option

The gcc target attribute can be used out of hand like so

[[gnu::target("avx")]]
void foo(){}

[[gnu::target("default")]]
void foo(){}

[[gnu::target("arch=sandybridge")]]
void foo(){}

the call then becomes

foo();

This option does away with the need to name a function differently. If you check out godbolt for example you will see that it creates @gnu_indirect_function for you. set it first to a .resolver function. Which reads the __cpu_model to find out what can be used and set the indirect function to that pointer so any subsequent calls will be a simple function indirect. simple aint it. But you might need to remain closer to you original code base therefore there are other ways

function switching

If you do need function switching like in your original example. the following can be used. Which uses nicely worded buildtins so its clear that you are switching on architecture

[[gnu::target("avx")]]
int foo_avx(){ return 1;}

[[gnu::target("default")]]
int foo(){return 0;}

[[gnu::target("arch=sandybridge")]]
int foo_sandy(){return 2;}

int main ()
{
    if (__builtin_cpu_is("sandybridge"))
        return foo_sandy();
    else if (__builtin_cpu_supports("avx"))
        return  foo_avx();
    else
        return foo();
}

Define your own indirect function

Because of reasons to be more verbose to others or platforms concerns were indirect functions might not be a supported use case. Below is a way that does the same as the first option but all in c++ code. using a static local function pointer. This means you could order the priority for targets to your own liking or on cases were the build in isn't supported. You can supply your own.

auto foo()
{
    using T = decltype(foo_default);
    static T* pointer = nullptr;
    //static int (*pointer)() = nullptr; 
    if (pointer == nullptr)
    {
    if (__builtin_cpu_is("sandybridge"))
        pointer = &foo_sandy;
    else if (__builtin_cpu_supports("avx"))
        pointer = &foo_avx;
    else
        pointer = &foo_default;        
    }
    return pointer();
};

As a bonus note

the following templated example on godbolt uses template<class ... Ts> to deal with overloads of your functions meaning if you define a family of callXXXVersion(int) then foo(int) will happily call the overloaded version for you. as long as you defined the entire family.

Dolley answered 5/8, 2020 at 17:52 Comment(3)
I'd suggest using "haswell" as your example specific-ISA, instead of SnB. Haswell includes FMA, so that's often something you'd want to switch on separately from AVX. Also, how do the target settings interact with tune options? Would a "haswell" version get compiled with -mno-avx256-split-unaligned-load, or would it use the tune=generic behaviour that's inefficient when it can't prove that pointers are always 32-byte aligned? (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)Perfect
Dont have all the answers to your comment @PeterCordes. Do have 2 things to say. 1) [[gnu::target("arch=sandybridge,tune=haswell")]] is valid in gcc this seems to emit the same codes as if the were entered o the command line options. 2) My impression is that if a target attribute doesn't include its own specif tune it will use the tune as specified by the command line options.Dolley
Ok, well definitely something to check on if using these attributes. The default tuning of -mavx256-split-unaligned-load can defeat part of the benefit of having an AVX2 and/or FMA version of some code.Perfect
B
1

Here's my solution. I can compile with AVX2 support and still run on my Ivy Bridge processor (AVX only) just fine.

The functions are:

__attribute__((target("arch=haswell")))
void fir_avx2_std(STD_DEF){
    STD_FIR;    
}

__attribute__((target("arch=sandybridge")))
void fir_avx_std(STD_DEF){
    STD_FIR;
}

//Use default - no arch specified
void fir_sse_std(STD_DEF){
    STD_FIR;    
}

The call is:

if (s.HW_AVX2 && s.OS_AVX){
    fir_avx2_std(STD_Call);
}else if(s.HW_AVX && s.OS_AVX){
    fir_avx_std(STD_Call);
}else{
    fir_sse_std(STD_Call);
}   

s is a structure that is populated based on some code I found online (https://github.com/Mysticial/FeatureDetector)

STD_FIR is a macro with the actual code, which gets optimized differently for each architecture.

I'm compiling with: -std=c11 -ffast-math -O3

I originally had -march=haswell as well, but that was causing problems.

Note, I'm not entirely sure if this is the best target breakdowns ... Also, I tried getting target_clones to work, but I was getting an error about needing ifunc (I thought gcc did that for me ...)

Blab answered 7/5, 2019 at 0:31 Comment(7)
-march=haswell tells the compiler it can emit Haswell-only code (like BMI2 shlx for variable-count shifts) everywhere, except in the few functions where you overrode the target attribute. You should compile with something like -mtune=haswell, or if you want to set a baseline of SSE4.2 + popcnt or something, -march=nehalem -mtune=haswellPerfect
Do CPU feature detection once and set some function pointers in a global struct of function pointers, or just loose pointers if you only have one or two. Or at least simplify so you don't have to check 2 different things every time you dispatch to the function! Also, you have a typo: your AVX1 check is still checking s.HW_AVX2, so IvB and earlier will only use the sse version.Perfect
@PeterCordes as always thanks for your helpful comments. It never occurred to me to simplify the two flags!Blab
Oh, and I just noticed you're using arch=sandybridge for your non-AVX function. That will compile it with VEX prefixes, because SnB supports AVX! Perhaps you want arch=nehalem for SSE4.2 + popcnt support? Or since it's the fallback, probably with no attribute so it gets the baseline you're compiling against. And unless you need AVX + something else like F16C (half-precision float conversion), your AVX function should probably be compiled with arch=sandybridge. Or maybe just target(avx,popcnt).Perfect
@PeterCordes Again, thanks for the catch. I've never learned how to examine the generated assembly so I'm not entirely sure what I need to version against (perhaps it is time to learn this). I was also surprised looking at the x86 options how many additional instructions there were besides things like SSE, AVX, FMA, and BMI (e.g. FSGSBASE, RDRND, F16C, ADCX, etc.) Even something like FMA (and others?) is likely useful in this case so it isn't necessarily as simple as choosing between SSE4.2, AVX, and AVX2.Blab
Most of those instructions are ones gcc won't use on its own. e.g. wrfsbase is only useful to allocate thread-local storage in user-space. rdrand generates a random number, and I don't think even C++ standard-library std::random_device will use it via an intrinsic. And GCC won't use half-precision math when auto-vectorizing. It might possibly use ADCX for some cases with __int128, though, instead of regular adc. You can look at the generated asm with How to remove "noise" from GCC/clang assembly output?Perfect
But yeah, FMA is very valuable for FP routines. You should probably still enable Haswell for your AVX2 function, and check the FMA feature bit as well as AVX2 when setting up function pointers. (There is one CPU with AVX2 but not FMA, a Via Nano something, unfortunately.)Perfect

© 2022 - 2024 — McMap. All rights reserved.