TL:DR: Don't use the _mm256_zeroupper()
intrinsic manually, compilers understand SSE/AVX transition stuff and emit vzeroupper
where needed for you. (Including when auto-vectorizing or expanding memcpy/memset/whatever with YMM regs.)
"Some Intel processors" being all except Xeon Phi.
Xeon Phi (KNL / KNM) don't have a state optimized for running legacy SSE instructions because they're purely designed to run AVX-512. Legacy SSE instructions probably always have false dependencies merging into the destination.
On mainstream CPUs with AVX or later, there are two different mechanisms: saving dirty uppers (SnB through Haswell, and Ice Lake) or false dependencies (Skylake). See Why is this SSE code 6 times slower without VZEROUPPER on Skylake? the two different styles of SSE/AVX penalty
Related Q&As about the effects of asm vzeroupper
(in the compiler-generated machine code):
Intrinsics in C or C++ source
You should pretty much never use _mm256_zeroupper()
in C/C++ source code. Things have settled on having the compiler insert a vzeroupper
instruction automatically where it might be needed, which is pretty much the only sensible way for compilers to be able to optimize functions containing intrinsics and still reliably avoid transition penalties. (Especially when considering inlining). All the major compilers can auto-vectorize and/or inline memcpy/memset/array init with YMM registers, so need to keep track of using vzeroupper
after that.
The convention is to have the CPU in clean-uppers state when calling or returning, except when calling functions that take __m256
/ __m256i/d
args by value (in registers or at all), or when returning such a value. The target function (callee or caller) inherently must be AVX-aware and expecting a dirty-upper state because a full YMM register is in-use as part of the calling convention.
x86-64 System V passes vectors in vector regs. Windows vectorcall
does, too, but the original Windows x64 convention (now named "fastcall" to distinguish from "vectorcall") passes vectors by value in memory via hidden pointer. (This optimizes for variadic functions by making every arg always fit in an 8-byte slot.) IDK how compilers compiling Windows non-vectorcall calls handle this, whether they assume the function probably looks at its args or at least is still responsible for using a vzeroupper
at some point even if it doesn't. Probably yes, but if you're writing your own code-gen back-end, or hand-written asm, have a look at what some compilers you care about actually do if this case is relevant for you.
Some compilers optimize by also omitting vzeroupper
before returning from a function that took vector args, because clearly the caller is AVX-aware. And crucially, apparently compilers shouldn't expect that calling a function like void foo(__m256i)
will leave the CPU in clean-upper state, so the callee does still need a vzeroupper
after such a function, before call printf
or whatever.
Compilers have options to control vzeroupper
usage
For example, GCC -mno-vzeroupper
/ clang -mllvm -x86-use-vzeroupper=0
. (The default is -mvzeroupper
to do the behaviour described above, using when it might be needed.)
This is implied by -march=knl
(Knight's Landing) because it's not needed and very slow on Xeon Phi CPUs (thus should actively be avoided).
Or you might possibly want it if you build libc (and any other libraries you use) with -mavx -mno-veroupper
. glibc has some hand-written asm for functions like strlen, but most of those have AVX2 versions. So as long as you're not on an AVX1-only CPU, legacy-SSE versions of string functions might not get used at all.
For MSVC, you should definitely prefer using -arch:AVX
when compiling code that uses AVX intrinsics. I think some versions of MSVC could generate code that caused transition penalties if you mixed __m128
and __m256
without /arch:AVX
. But beware that that option will make even 128-bit intrinsics like _mm_add_ps
use the AVX encoding (vaddps
) instead of legacy SSE (addps
), though, and will let the compiler auto-vectorize with AVX. There is undocumented switch /d2vzeroupper
to enable automatic vzeroupper
generation (default), /d2vzeroupper-
disables it - see What is the /d2vzeroupper MSVC compiler optimization flag doing?
Corner case where MSVC and GCC/clang can be tricked into executing a legacy-SSE encoding that writes an XMM register with dirty uppers:
Compiler heuristics may be assuming that there will be a VEX encoding available for any instruction in a function that's definitely (unconditionally) already executed AVX instructions. But that's not the case; some, like cvtpi2ps xmm, mm
(MMX+SSE) or movqd2d xmm, mm
(SSE2) don't have VEX forms. Nor does _mm_sha1rnds4_epu32
- it was first introduced on Silvermont-family which didn't support AVX until Gracemont (Alder Lake), so it was introduced with 128-bit non-VEX encoding and still hasn't got a VEX encoding.
#include <immintrin.h>
void bar(char *dst, char *src)
{
__m256 vps = _mm256_loadu_ps((float*)src);
_mm256_storeu_ps((float*)dst, _mm256_sqrt_ps(vps));
#if defined(__SHA__) || defined(_MSC_VER)
__m128i t1 = _mm_loadu_si128((__m128i*)&src[32]);
// possible MSVC bug, writing an XMM with a legacy VEX while an upper might be dirty
__m128i t2 = _mm_sha1rnds4_epu32(t1,t1, 3); // only a non-VEX form exists
t1 = _mm_add_epi8(t1,t2);
_mm_storeu_si128((__m128i*)&dst[32], t1);
#endif
#ifdef __MMX__ // MSVC for some reason dropped MMX support in 64-bit mode; IDK if it defines __MMX__ even in 32-bit but whatever
__m128 tmpps = _mm_loadu_ps((float*)&src[48]);
tmpps = _mm_cvtpi32_ps(tmpps, *(__m64*)&src[48]);
_mm_storeu_ps((float*)&dst[48], tmpps);
#endif
}
(This is not a sensible way to use SHA or cvtpi2ps
, just randomly using vpaddb
to force some extra register copying.)
Godbolt
# clang -O3 -march=icelake-client
bar(char*, char*):
vsqrtps ymm0, ymmword ptr [rsi]
vmovups ymmword ptr [rdi], ymm0 # first block, AVX1
vmovdqu xmm0, xmmword ptr [rsi + 32]
vmovdqa xmm1, xmm0
sha1rnds4 xmm1, xmm0, 3 # non-VEX encoding while uppers still dirty.
vpaddb xmm0, xmm1, xmm0
vmovdqu xmmword ptr [rdi + 32], xmm0
vmovups xmm0, xmmword ptr [rsi + 48]
movdq2q mm0, xmm0
cvtpi2ps xmm0, mm0 # again same thing
vmovups xmmword ptr [rdi + 48], xmm0
vzeroupper # vzeroupper not done until here, too late for code in this function.
ret
MSVC and GCC are about the same. (Although GCC optimizes away the use of the MMX register in this case, using vcvtdq2ps
/ vshufps
. That presumably wouldn't always happen.)
These are compiler bugs that should be fixed in the compiler, although you may be able to work around them with _mm256_vzeroupper()
in specific cases if necessary.
Normally compiler heuristics work fine; e.g. the asm block for if(a) _mm256...
will end with a vzeroupper
if later code in the function might conditionally run legacy SSE encodings of normal instructions like paddb
. (This is only possible with MSVC; gcc/clang require functions containing AVX1 / 2 instructions to be compiled with __attribute__((target("avx")))
or "avx2"
, which lets them use vpaddb
for _mm_add_epi8
anywhere in the function. You have to branch / dispatch based on CPU features on a per-function level, which makes sense because normally you'd want to run a whole loop with AVX or not.)