-mavx
/-mavx2
/-mavx512f
(and -march=
options that imply them with relevant tuning settings) let GCC use AVX / AVX2 / AVX-512 instructions for anything it thinks is a good idea when compiling your code, including but not limited to auto-vectorization of loops, if you also enable that.
Other use-cases for SSE instructions (where GCC will use the AVX encoding if you tell it AVX is enabled) include copying and zero-initializing structs and arrays, and other cases of inlining small constant-size memset
and memcpy
. And also scalar FP math, even at -O0
in 64-bit code where -mfpmath=sse
is the default.
Code built with -mavx
usually can't be run on CPUs without AVX, even if auto-vectorization wasn't enabled and you didn't use any AVX intrinsics; it makes GCC use the VEX encoding instead of legacy SSE for every SIMD instruction. AVX2, on the other hand, doesn't usually get used except when actually auto-vectorizing a loop. It's not relevant for just copying data around, or for scalar FP math. GCC will use scalar FMA instructions if -mfma
is enabled, though.
Examples on Godbolt
void ext(void *);
void caller(void){
int arr[16] = {0};
ext(arr);
}
double fp(double a, double b){
return b-a;
}
compiles with AVX instructions with gcc -O2 -fno-tree-vectorize -march=haswell
, because when AVX is enabled, GCC completely avoids legacy-SSE encodings everywhere.
caller:
sub rsp, 72
vpxor xmm0, xmm0, xmm0
mov rdi, rsp
vmovdqa XMMWORD PTR [rsp], xmm0 # only 16-byte vectors, not using YMM + vzeroupper
vmovdqa XMMWORD PTR [rsp+16], xmm0
vmovdqa XMMWORD PTR [rsp+32], xmm0
vmovdqa XMMWORD PTR [rsp+48], xmm0
call ext
add rsp, 72
ret
fp:
vsubsd xmm0, xmm1, xmm0
ret
-m
options do not enable auto-vectorization; -ftree-vectorize
does that. It's on at -O3
and higher. (Or in a limited form at -O2
with GCC12 and later, only vectorizing when "very cheap" like when it knows the iteration count is a multiple of 4 or whatever so it can vectorize without a cleanup loop. clang fully enables auto-vectorization at -O2
.)
If you do want auto-vectorization with enabled extensions, use -O3
as well, and preferably -march=native
or -march=znver2
or something instead of just -mavx2
. -march
sets tuning options as well, and will enable other ISA extension you probably forgot about, like -mfma
and -mbmi2
.
The tuning options implied by -march=haswell
(or just -mtune=haswell
) are especially useful on older GCC, when tune=generic
cared more about old CPUs that didn't have AVX2, or where doing unaligned 256-bit loads as two separate parts was a win in some cases: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?
Unfortunately there isn't anything like -mtune=generic-avx2
or -mtune=enabled-extension
to still care about both AMD and Intel CPUs, but not about ones too old for all the extensions you enabled.
When manually vectorizing with intrinsics, you can only use intrinsics for instruction-sets you've enabled. (Or ones that are on by default, like SSE2 which is baseline for x86-64, and often enabled even with -m32
in modern GCC configs.)
e.g. if you use _mm256_add_epi32
, your code won't compile unless you use -mavx2
. (Or better, something like -march=haswell
or -march=native
that enables AVX2, FMA, BMI2, and other stuff modern x86 has, and sets appropriate tuning options.)
The GCC error message in that case is error: inlining failed in call to 'always_inline' '_mm256_loadu_si256': target specific option mismatch
.
In GCC terminology, the "target" is the machine you're compiling for. i.e. -mavx2
tells GCC that the target supports AVX2. Thus GCC will make an executable that might use AVX2 instructions anywhere, e.g. for copying a struct or zero-initializing a local array, or otherwise expanding a small constant-size memcpy or memset.
It will also define the CPP macro __AVX2__
, so #ifdef __AVX2__
can test whether AVX2 can be assumed at compile-time.
If that's not what you want for the whole program, you need to make sure not to use -mavx2
to compile any code that gets called without a run-time check of CPU features. e.g. put your AVX2 versions of functions in a separate file to compile with -mavx2
, or use __attribute__((target("avx2")))
. Have your program set function pointers after checking __builtin_cpu_supports("avx2")
, or use GCC's ifunc
dispatching mechanism to do multi-versioning.
-m
options do not on their own enable auto-vectorization
(Auto-vectorization is not the only way GCC can use SIMD instruction sets.)
-ftree-vectorize
(enabled as part of -O3
, or even at -O2
in GCC12 and later) is necessary for GCC to auto-vectorize. And/or -fopenmp
if the code has some #pragma omp simd
. (You definitely always want at least -O2
or -Os
if you care about performance; -O3
should be fastest, but may not always be. Sometimes GCC has missed-optimization bugs where -O3 makes things worse, or in large programs it might happen that larger code-size costs more I-cache and I-TLB misses.)
When auto-vectorizing and optimizing in general, GCC will (maybe) use any instruction sets you told it were available (with -m
options). So for example, -O3 -march=haswell
will auto-vectorize with AVX2 + FMA. -O3
without -m
options will just auto-vectorize with SSE2.
e.g. compare on Godbolt GCC -O3 -march=nehalem
(SSE4.2) vs. -march=znver2
(AVX2) for summing an integer array. (Compile-time constant size to keep the asm simple).
If you use -O3 -mgeneral-regs-only
(the latter option normally only used in kernel code), GCC will still auto-vectorize, but only in cases where it thinks it's profitable to do SWAR (e.g. xor of an array is straightforward using 64-bit integer regs, or even sum of bytes using SWAR bit-hacks to block/correct for carry between bytes)
e.g. gcc -O1 -mavx
still just uses scalar code.
Normally if you want full optimization but not auto-vectorization, you'd use something like -O3 -march=znver1 -fno-tree-vectorize
Other compilers
All of the above is true for clang as well, except it doesn't understand -mgeneral-regs-only
. (I think you'd need -mno-mmx -mno-sse
and maybe other options.)
(The Effect of Architecture When Using SSE / AVX Intrinisics repeats some of this info)
For MSVC / ICC, you can use intrinsics for ISA extensions you haven't told the compiler it can use on its own. So for example, MSVC -O2
without -arch:AVX
would let it auto-vectorize with SSE2 (because that's baseline for x86-64), and use movaps
for copying around 16-byte structs or whatever.
But with MSVC's style of target options, you can still use SSE4 intrinsics like _mm_cvtepi8_epi32
(pmovsxwd
), or even AVX intrinsics without telling the compiler its allowed to use those instructions itself.
Older MSVC used to make really bad asm when you used AVX / AVX2 intrinsics without -arch:AVX
, e.g. resulting in mixing VEX and legacy-SSE encodings in the same function (e.g. using the non-VEX encoding for 128-bit intrinsics like _mm_add_ps
), and failure to use vzeroupper after 256-bit vectors, both of which were disastrous for performance.
But I think modern MSVC has mostly solved that. Although it still doesn't optimize intrinsics much at all, like not even doing constant-propagation through them.
Not optimizing intrinsics is likely related to MSVC's ability to let you write code like if(avx_supported) { __m256 v = _mm256_load_ps(p); ...
and so on. If it was trying to optimize, it would have to keep track of the minimum extension-level already seen along paths of execution that could reach any given intrinsic, so it would know what alternatives would be valid. ICC is like that, too.
For the same reason, GCC can't inline functions with different target options into each other. So you can't use __attribute__((target("")))
to avoid the cost of run-time dispatching; you still want to avoid function-call overhead inside a loop, i.e. make sure there's a loop inside the AVX2 function, otherwise it may not be worth having an AVX2 version, just use the SSE2 version.
I don't know about Intel's new OneAPI compiler, ICX. I think it's based on LLVM, so it might be more like clang.
These switches enable the use of instructions
the end. Nothing more. – Tengdin