What exactly do the gcc compiler switches (-mavx -mavx2 -mavx512f) do?
Asked Answered
S

2

6

I explicitly use the Intel SIMD extensions intrinsic in my C/C++ code. In order to compile the code I need to specify -mavx, or -mavx512, or something similar on the command line. I'm good with all that.

However, from reading the gcc man page, it's not clear if these command-line flags also tell the gcc compiler to try to automatically vectorize the C/C++ code with the Intel SIMD instructions. Does someone know if that is the case? Does the -mavx flag simply allow you to manually insert SIMD intrinsics into your code, or does it also tell the compiler to use the SIMD instructions when compiling you C/C++ code?

Stempien answered 22/2, 2022 at 22:56 Comment(3)
gcc.gnu.org/onlinedocs/gcc-11.2.0/gcc/…Tengdin
Read that already. That's basically the man page. It's not clear if gcc tries to autovectorize the code. At least not to me.Stempien
That's unrelated if it tries or not. These switches enable the use of instructions the end. Nothing more.Tengdin
S
16

-mavx/-mavx2/-mavx512f (and -march= options that imply them with relevant tuning settings) let GCC use AVX / AVX2 / AVX-512 instructions for anything it thinks is a good idea when compiling your code, including but not limited to auto-vectorization of loops, if you also enable that.

Other use-cases for SSE instructions (where GCC will use the AVX encoding if you tell it AVX is enabled) include copying and zero-initializing structs and arrays, and other cases of inlining small constant-size memset and memcpy. And also scalar FP math, even at -O0 in 64-bit code where -mfpmath=sse is the default.

Code built with -mavx usually can't be run on CPUs without AVX, even if auto-vectorization wasn't enabled and you didn't use any AVX intrinsics; it makes GCC use the VEX encoding instead of legacy SSE for every SIMD instruction. AVX2, on the other hand, doesn't usually get used except when actually auto-vectorizing a loop. It's not relevant for just copying data around, or for scalar FP math. GCC will use scalar FMA instructions if -mfma is enabled, though.

Examples on Godbolt

void ext(void *);
void caller(void){
    int arr[16] = {0};
    ext(arr);
}

double fp(double a, double b){
    return b-a;
}

compiles with AVX instructions with gcc -O2 -fno-tree-vectorize -march=haswell, because when AVX is enabled, GCC completely avoids legacy-SSE encodings everywhere.

caller:
        sub     rsp, 72
        vpxor   xmm0, xmm0, xmm0
        mov     rdi, rsp
        vmovdqa XMMWORD PTR [rsp], xmm0         # only 16-byte vectors, not using YMM + vzeroupper
        vmovdqa XMMWORD PTR [rsp+16], xmm0
        vmovdqa XMMWORD PTR [rsp+32], xmm0
        vmovdqa XMMWORD PTR [rsp+48], xmm0
        call    ext
        add     rsp, 72
        ret

fp:
        vsubsd  xmm0, xmm1, xmm0
        ret

-m options do not enable auto-vectorization; -ftree-vectorize does that. It's on at -O3 and higher. (Or in a limited form at -O2 with GCC12 and later, only vectorizing when "very cheap" like when it knows the iteration count is a multiple of 4 or whatever so it can vectorize without a cleanup loop. clang fully enables auto-vectorization at -O2.)

If you do want auto-vectorization with enabled extensions, use -O3 as well, and preferably -march=native or -march=znver2 or something instead of just -mavx2. -march sets tuning options as well, and will enable other ISA extension you probably forgot about, like -mfma and -mbmi2.

The tuning options implied by -march=haswell (or just -mtune=haswell) are especially useful on older GCC, when tune=generic cared more about old CPUs that didn't have AVX2, or where doing unaligned 256-bit loads as two separate parts was a win in some cases: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?

Unfortunately there isn't anything like -mtune=generic-avx2 or -mtune=enabled-extension to still care about both AMD and Intel CPUs, but not about ones too old for all the extensions you enabled.


When manually vectorizing with intrinsics, you can only use intrinsics for instruction-sets you've enabled. (Or ones that are on by default, like SSE2 which is baseline for x86-64, and often enabled even with -m32 in modern GCC configs.)

e.g. if you use _mm256_add_epi32, your code won't compile unless you use -mavx2. (Or better, something like -march=haswell or -march=native that enables AVX2, FMA, BMI2, and other stuff modern x86 has, and sets appropriate tuning options.)

The GCC error message in that case is error: inlining failed in call to 'always_inline' '_mm256_loadu_si256': target specific option mismatch.

In GCC terminology, the "target" is the machine you're compiling for. i.e. -mavx2 tells GCC that the target supports AVX2. Thus GCC will make an executable that might use AVX2 instructions anywhere, e.g. for copying a struct or zero-initializing a local array, or otherwise expanding a small constant-size memcpy or memset.

It will also define the CPP macro __AVX2__, so #ifdef __AVX2__ can test whether AVX2 can be assumed at compile-time.

If that's not what you want for the whole program, you need to make sure not to use -mavx2 to compile any code that gets called without a run-time check of CPU features. e.g. put your AVX2 versions of functions in a separate file to compile with -mavx2, or use __attribute__((target("avx2"))). Have your program set function pointers after checking __builtin_cpu_supports("avx2"), or use GCC's ifunc dispatching mechanism to do multi-versioning.


-m options do not on their own enable auto-vectorization

(Auto-vectorization is not the only way GCC can use SIMD instruction sets.)

-ftree-vectorize (enabled as part of -O3, or even at -O2 in GCC12 and later) is necessary for GCC to auto-vectorize. And/or -fopenmp if the code has some #pragma omp simd. (You definitely always want at least -O2 or -Os if you care about performance; -O3 should be fastest, but may not always be. Sometimes GCC has missed-optimization bugs where -O3 makes things worse, or in large programs it might happen that larger code-size costs more I-cache and I-TLB misses.)

When auto-vectorizing and optimizing in general, GCC will (maybe) use any instruction sets you told it were available (with -m options). So for example, -O3 -march=haswell will auto-vectorize with AVX2 + FMA. -O3 without -m options will just auto-vectorize with SSE2.

e.g. compare on Godbolt GCC -O3 -march=nehalem (SSE4.2) vs. -march=znver2 (AVX2) for summing an integer array. (Compile-time constant size to keep the asm simple).

If you use -O3 -mgeneral-regs-only (the latter option normally only used in kernel code), GCC will still auto-vectorize, but only in cases where it thinks it's profitable to do SWAR (e.g. xor of an array is straightforward using 64-bit integer regs, or even sum of bytes using SWAR bit-hacks to block/correct for carry between bytes)

e.g. gcc -O1 -mavx still just uses scalar code.

Normally if you want full optimization but not auto-vectorization, you'd use something like -O3 -march=znver1 -fno-tree-vectorize


Other compilers

All of the above is true for clang as well, except it doesn't understand -mgeneral-regs-only. (I think you'd need -mno-mmx -mno-sse and maybe other options.)

(The Effect of Architecture When Using SSE / AVX Intrinisics repeats some of this info)

For MSVC / ICC, you can use intrinsics for ISA extensions you haven't told the compiler it can use on its own. So for example, MSVC -O2 without -arch:AVX would let it auto-vectorize with SSE2 (because that's baseline for x86-64), and use movaps for copying around 16-byte structs or whatever.

But with MSVC's style of target options, you can still use SSE4 intrinsics like _mm_cvtepi8_epi32 (pmovsxwd), or even AVX intrinsics without telling the compiler its allowed to use those instructions itself.

Older MSVC used to make really bad asm when you used AVX / AVX2 intrinsics without -arch:AVX, e.g. resulting in mixing VEX and legacy-SSE encodings in the same function (e.g. using the non-VEX encoding for 128-bit intrinsics like _mm_add_ps), and failure to use vzeroupper after 256-bit vectors, both of which were disastrous for performance.

But I think modern MSVC has mostly solved that. Although it still doesn't optimize intrinsics much at all, like not even doing constant-propagation through them.

Not optimizing intrinsics is likely related to MSVC's ability to let you write code like if(avx_supported) { __m256 v = _mm256_load_ps(p); ... and so on. If it was trying to optimize, it would have to keep track of the minimum extension-level already seen along paths of execution that could reach any given intrinsic, so it would know what alternatives would be valid. ICC is like that, too.

For the same reason, GCC can't inline functions with different target options into each other. So you can't use __attribute__((target(""))) to avoid the cost of run-time dispatching; you still want to avoid function-call overhead inside a loop, i.e. make sure there's a loop inside the AVX2 function, otherwise it may not be worth having an AVX2 version, just use the SSE2 version.

I don't know about Intel's new OneAPI compiler, ICX. I think it's based on LLVM, so it might be more like clang.

Spirituel answered 23/2, 2022 at 9:40 Comment(1)
(Parts of this answer are redundant; it was getting long so I started again at the top, but then didn't take out much of what I'd already written. I may get back to it, or edits are welcome that remove whole paragraphs if they're really redundant. I thought it might be helpful to some readers to repeat things in more detail a 2nd time, so I left in the more long-winded parts in the middle, but some of it might be excessive. Basically I got tired of editing it and posted what I had :P )Spirituel
K
-2

Currently used gcc 11.3.1 or higher. I am not programmer but distinguish between C and C++. I have been producing the latest codecs on github / doom9 forum for three years. On my old Intel (R) Core (TM) i5-2500K CPU @ 3.30GHz I notice that. In C language you can play SIMD AVX2 ex. assempler codecs for non-SIMD processor. Can we use codecs posted on the forum? Who knows that. Ex. libjpeg, dav1d with SIMD without mavx2.

xeve, xevd, uvg266, uavs3e, uavs3d, aom, libavif

In C++ SIMD AVX2 you won't even open help. The second thing is thread and compatibility Unix with Windows. In C this works faster than in C++. Also in C++ you have to add some special untested additions like mingw-std-thread to g++ to get everything working. Another curiosity about C++. MSYS2 GCC 12.1.0. Codecs made in AVX2/AVX3 open on old processors. How is it made? I don't know, but not with the functions above.

jpegxl, libwebp2, libheif, jvetvvc, vvenc, vvdec, libraw, jpegls, jpegxt, openhtj2k, openjph, grok(C++20 openjpeg)

Kristykristyn answered 31/7, 2022 at 9:51 Comment(8)
If C code actually does use AVX2 instructions, it won't run on a Sandy Bridge CPU like your i5 2500K. There isn't a general difference between C and C++ in how that works, perhaps just in the code you're building happening not to actually use any AVX2 instructions. Most video codecs with hand-written assembly (like x264 / x265) do runtime dispatching based on CPU detection, to avoid running any instructions that the current CPUs doesn't support.Spirituel
There's no such thing as AVX3. Do you mean AVX-512? Anyway, your practical experiences might possibly be useful to future readers if you said in more detail what you did. But the things you're claiming aren't generally true, so I don't think that's helpful. For example, godbolt.org/z/qMevsao8s shows a trivial C program that gcc -O3 -march=haswell compiles to use AVX2 instructions. It could optimize away (factorial of a large constant, with unsigned wrapping so the answer is probably always zero) but gcc and clang happen not to.Spirituel
Most video codecs with hand-written assembly (like x264 / x265) do runtime dispatching based on CPU detection, to avoid running any instructions that the current CPUs doesn't support. Doesn't apply to libjpeg, dav1d. For X265 in SIMD is SSE,SSE2,SSE3,SSSE3,SSE4,AVX2. In site http://msystem.waw.pl/x265/ codes have assebler SIMD ex. SSE2. x265 [info]: HEVC encoder version 3.5+39 x265 [info]: build info [Windows][GCC 11.3.1][64 bit] 8bit+10bit+12bit x265 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVXKristykristyn
Google highway has AVX3Kristykristyn
github.com/google/highway claims Ice Lake is required for AVX3_DL. But Intel doesn't use that name for any features Ice Lake has, not that I've ever seen in Intel's asm manuals, or on en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(client) or Wikipedia. I assume it's a made-up name for some AVX-512 feature. Ah, in the source code they're just using it to indicate a set of features including AVX-512VNNI, VPOPCNTDQ, VBMI, etc. which are useful on 32-byte vecs github.com/google/highway/blob/…Spirituel
If a library doesn't do CPU feature detection at run time, then building a binary that unconditionally uses them will crash on old CPUs that don't support it. (Or worse give wrong results, like with lzcnt on CPUs with BMI1.) It doesn't matter what source language it was built from. It only matters whether or not the library actually uses those instructions, at least for inputs you test it with. If a library maybe only uses AVX2 for high color depth, then with some input files it would work on you Sandy Bridge. Depends on the library and compile options, not necessarily C vs. C++.Spirituel
Concluding the argument. I was asked once again why I create codecs without SIMD. After all, codecs work with and without. Examples: mediafire.com/file/s6srzkosonu5yai/x265_3.5+39-a599806d3.7z/… mediafire.com/file/buws05sy4o6chy7/VTM_17.1rc1_da38667a.7z/fileKristykristyn
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Castalia

© 2022 - 2024 — McMap. All rights reserved.