The Effect of Architecture When Using SSE / AVX Intrinisics
Asked Answered
D

2

1

I wonder how does a Compiler treats Intrinsics.

If one uses SSE2 Intrinsics (Using #include <emmintrin.h>) and compile with -mavx flag. What will the compiler generate? Will it generate AVX or SSE code?

If one uses AVX2 Intrinsics (Using #include <immintrin.h>) and compile with -msse2 flag. What will the compiler generate? Will it generate SSE Only or AVX code?

How does compilers treat Intrinsics?
If one uses Intrinsics, does it help the compiler understand the dependency in the loop for better vectorization?

For instance, what's going on here - https://godbolt.org/z/Y4J5OA (Or https://godbolt.org/z/LZOJ2K)?
See all 3 panes.

The Context

I'm trying to build various version of the same functions with different CPU features (SSE4 and AVX2).
I'm writing the same version one with SSE Intrinsics and once with AVX Intrinsics.
Let's say theyare name MyFunSSE() and MyFunAVX(). Both are in the same file.

How can I make the Compiler (Same method should work for MSVC, GCC and ICC) build each of them using only the respective functions?

Dehumidifier answered 18/4, 2019 at 14:6 Comment(4)
Updated my answer. I think you're just looking for GNU C's __attribute__((target("avx"))).Exegete
godbolt.org/z/lRr9q7, godbolt.org/z/3pKKT2, godbolt.org/z/vViboKDehumidifier
What am I looking for in those links? compilers use VEX encodings when you compile with -mavx2, and they don't when you don't. This is how it's always worked for gcc/clang/ICC. (And MSVC for -arch:AVX or not.)Exegete
BTW, it's pointless to #include <emmintrin.h> if you're also going to include the catch-all #include <immintrin.h>. Always just #include <immintrin.h>, unless you want to include less on MSVC to stop yourself from accidentally using certain extensions, because its target-options model is different from gcc/clang.Exegete
E
5

GCC and clang require that you enable all extensions you use. Otherwise it's a compile-time error, like error: inlining failed to call always_inline error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch

Using -march=native or -march=haswell or whatever is preferred over enabling specific extensions, because that also sets appropriate tuning options. And you don't forget useful ones like -mpopcnt that will let std::bitset::count() inline a popcnt instruction, and make all variable-count shifts more efficient with BMI2 shlx / shrx (1 uop vs. 3)


MSVC and ICC do not, and will let you use intrinsics to emit instructions that they couldn't auto-vectorize with.

You should definitely enable AVX if you use AVX intrinsics. Older MSVC without enabling AVX didn't always use vzeroupper automatically where needed, but that's been fixed for a few years. Still, if your whole program can assume AVX support, definitely tell the compiler about it even for MSVC.


For compilers that support GNU extensions (GCC, clang, ICC), you can use stuff like __attribute__((target("avx"))) on specific functions in a compilation unit. Or better, __attribute__((target("arch=haswell"))) to maybe also set tuning options. (That also enables AVX2 and FMA, which you might not want. And I'm not sure if target attributes do set -mtune=xx). See https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html

__attribute__((target())) will prevent them from inlining into functions with other target options, so be careful to use this on functions they will inline into, if the function itself is too small. Use it on a function containing a loop, not a helper function called in a loop.

See also https://gcc.gnu.org/wiki/FunctionMultiVersioning for using different target options on multiple definitions of the same function name, for compiler supported runtime dispatching. But I don't think there's a portable (to MSVC) way to do that.

See specify simd level of a function that compiler can use for more about doing runtime dispatch on GCC/clang.


With MSVC you don't need anything, although like I said I think it's normally a bad idea to use AVX intrinsics without -arch:AVX, so you might be better off putting those in a separate file. But for AVX vs. AVX2 + FMA, or SSE2 vs. SSE4.2, you're fine without anything.

Just #define AVX2_FUNCTION to the empty string instead of __attribute__((target("avx2,fma")))

#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
// apparently ICC doesn't support target attributes, despite supporting GNU C
#define TARGET_HASWELL __attribute__((target("arch=haswell")))
#else
#define TARGET_HASWELL   // empty
 // maybe warn if __AVX__ isn't defined for functions where this is used?
 // if you need to make sure MSVC uses vzeroupper everywhere needed.
#endif


TARGET_HASWELL
void foo_avx(float *__restrict dst, float *__restrict src)
{
   for (size_t i = 0 ; i<1024 ; i++) {
       __m256 v = _mm256_loadu_ps(src);
       ...
       ...
   }
}

With GCC and clang, the macro expands to the __attribute__((target)) stuff; with MSVC and ICC it doesn't.


ICC pragma:

https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-optimization-parameter documents a pragma which you'd want to put before AVX functions to make sure vzeroupper is used properly in functions that use _mm256 intrinsics.

#pragma intel optimization_parameter target_arch=AVX

For ICC, you could #define TARGET_AVX as this, and always used it on a line by itself before the function, where you can put an __attribute__ or a pragma. You might also want separate macros for defining vs. declaring functions, if ICC doesn't want this on declarations. And a macro to end a block of AVX functions, if you want to have non-AVX functions after them. (For non-ICC compilers, this would be empty.)

Exegete answered 18/4, 2019 at 14:41 Comment(18)
This questions come from having issues with ICC. I write functions with SSE Intrinsic and compile them with /arch:SSE2 and /Qax=CORE-AVX2 and yet get a code which fails on SSE only CPU's. I don't get it...Dehumidifier
@Royi: then you should create a minimal reproducible example for that (including the faulting instruction in a debugger) and ask about it! Your Godbolt link doesn't use any non-SSE2 instructions I can see, except in the tab that uses -mavx2.Exegete
The problem is the code is large and I don't have access to non AVX CPU. I just see the code I compile like that fails on people with non AVX CPU though I use this features of ICC.Dehumidifier
I will try editing the question to make my goal clearer. The goal is being able to generate code with SSE and AVX path. But If I build one function with SSE Intrinsics as one Flavor and AVX as the other yet when I compile I have to enable AVX and then the SSE code is also generating AVX I'm stuck.Dehumidifier
Peter, I added some info about the context.Dehumidifier
Peter, I didn't understand the MSVC part - #define AVX2_FUNCTION.Dehumidifier
Peter, I also don't see ICC support the __attribute__(target()) syntax. See godbolt.org/z/wPPJss.Dehumidifier
OK. It seems Intel Compiler doesn't support this - software.intel.com/en-us/…. It is only supports MIC - software.intel.com/en-us/….Dehumidifier
@Royi: I added an example of what I meant about using a #define. This kind of thing is 100% standard, using the empty string as a definition for a macro on compilers that don't need or don't support something. Like you might do the same for restrict, if you cared about any compilers that didn't support __restrict in C++ mode.Exegete
The problem is GCC __attribute__ needs to be next to the function declaration while ICC equivalent (software.intel.com/en-us/…) needs to be nest to the function definition. I wish they all had some kind of standard.Dehumidifier
@Royi: so you need 2 separate macros, one for declarations and one for definitions. For GCC, they get the same value. For ICC, the declaration part is empty. Less convenient, but still easily solvable with CPP macros.Exegete
I agree. Just wish there was more elegant way for CPU dispatching. By the way, any h files based library to inspect CPU and OS Features like Agner's AsmLib (Which a lib)?Dehumidifier
@Royi: Obviously you can't have a header that detects runtime features at compile time. People have made headers like Agner Fog's VectorClass that try to simplify compile-time detection of what target options were enabled across MSVC vs. other compilers, because MSVC doesn't define __SSE4_1__, only a few macros like AVX. There are compiler-specific helpers for runtime dispatching, like gcc's ifunc stuff, but it's a hard problem with efficiency tradeoffs and dependencies on linkers (depending on the mechanism you use). There isn't a portable standard way, AFAIK.Exegete
But compile-time dispatching with #ifdef __AVX__ is pretty easy, except on MSVC if you care about features it doesn't have macros for.Exegete
Peter, I mean dispatching at Run Time. The whole purpose is building few versions of the same function and then dispatch them on Run Time.Dehumidifier
@Royi: Then how could you possibly make CPU feature detection a header-only thing? (Unless you just mean you wanted inline functions defined in a header? But why, you only need to query CPU features once at program startup and record the result.)Exegete
I meant h and c and not Pre Compiled Lib and not asm.Dehumidifier
@Royi: Oh. GNU C has it built-in with __builtin_cpu_supports("avx"), but that's not portable to MSVC. See does gcc's __builtin_cpu_supports check for OS support? for an example. I'd assume that somebody has made a simple portable library, though, but I'm not aware of any specific one.Exegete
C
3

If you compile code with -mavx2 enabled your compiler will (usually) generate so-called "VEX encoded" instructions. In case of _mm_loadu_ps, this will generate vmovups instead of movups, which is almost equivalent, except that the latter will only modify the lower 128 bit of the target register, whereas the former will zero-out everything above the lower 128 bits. However, it will only run on machines which support at least AVX. Details on [v]movups are here.

For other instructions like [v]addps, AVX has the additional advantage of allowing three operands (i.e., the target can be different from both sources), which in some cases can avoid copying registers. E.g.,

_mm_mul_ps(_mm_add_ps(a,b), _mm_sub_ps(a,b));

requires a register copy (movaps) when compiled for SSE, but not when compiled for AVX: https://godbolt.org/z/YHN5OA


Regarding using AVX-intrinsics but compiling without AVX, compilers either fail (like gcc/clang) or silently generate the corresponding instructions which would then fail on machines without AVX support (see @PeterCordes answer for details on that).


Addendum: If you want to implement different functions depending on the architecture (at compile-time) you can check that using #ifdef __AVX__ or #if defined(__AVX__): https://godbolt.org/z/ZVAo-7

Implementing them in the same compilation unit is difficult, I think. The easiest solutions are to built different shared-libraries or even different binaries and have a small binary which detects the available CPU features and loads the corresponding library/binary. I assume there are related questions on that topic.

Cotinga answered 18/4, 2019 at 20:25 Comment(1)
Thank you for the added information. Have a look at the added information (The context) in my question.Dehumidifier

© 2022 - 2024 — McMap. All rights reserved.