Do I need to use _mm256_zeroupper in 2021?

Asked 11/8, 2021 at 5:40 Answered 11/8, 2021 at 8:12

From Agner Fog's "Optimizing software in C++":

There is a problem when mixing code compiled with and without AVX support on some Intel processors. There is a performance penalty when going from AVX code to non-AVX code because of a change in the YMM register state. This penalty should be avoided by calling the intrinsic function _mm256_zeroupper() before any transition from AVX code to nonAVX code. This can be necessary in the following cases:

• If part of a program is compiled with AVX support and another part of the program is compiled without AVX support then call _mm256_zeroupper() before leaving the AVX part.

• If a function is compiled in multiple versions with and without AVX using CPU dispatching then call _mm256_zeroupper() before leaving the AVX part.

• If a piece of code compiled with AVX support calls a function in a library other than the library that comes with the compiler, and the library has no AVX support, then call _mm256_zeroupper() before calling the library function.

I'm wondering what are some Intel processors. Specifically, are there processors made in the last five years. So that I know if it is too late to fix missing _mm256_zeroupper() calls or not.

Employment answered 11/8, 2021 at 5:40 Comment(0)

TL:DR: Don't use the _mm256_zeroupper() intrinsic manually, compilers understand SSE/AVX transition stuff and emit vzeroupper where needed for you. (Including when auto-vectorizing or expanding memcpy/memset/whatever with YMM regs.)

"Some Intel processors" being all except Xeon Phi.

Xeon Phi (KNL / KNM) don't have a state optimized for running legacy SSE instructions because they're purely designed to run AVX-512. Legacy SSE instructions probably always have false dependencies merging into the destination.

On mainstream CPUs with AVX or later, there are two different mechanisms: saving dirty uppers (SnB through Haswell, and Ice Lake) or false dependencies (Skylake). See Why is this SSE code 6 times slower without VZEROUPPER on Skylake? the two different styles of SSE/AVX penalty

Related Q&As about the effects of asm vzeroupper (in the compiler-generated machine code):

Intrinsics in C or C++ source

You should pretty much never use _mm256_zeroupper() in C/C++ source code. Things have settled on having the compiler insert a vzeroupper instruction automatically where it might be needed, which is pretty much the only sensible way for compilers to be able to optimize functions containing intrinsics and still reliably avoid transition penalties. (Especially when considering inlining). All the major compilers can auto-vectorize and/or inline memcpy/memset/array init with YMM registers, so need to keep track of using vzeroupper after that.

The convention is to have the CPU in clean-uppers state when calling or returning, except when calling functions that take __m256 / __m256i/d args by value (in registers or at all), or when returning such a value. The target function (callee or caller) inherently must be AVX-aware and expecting a dirty-upper state because a full YMM register is in-use as part of the calling convention.

x86-64 System V passes vectors in vector regs. Windows vectorcall does, too, but the original Windows x64 convention (now named "fastcall" to distinguish from "vectorcall") passes vectors by value in memory via hidden pointer. (This optimizes for variadic functions by making every arg always fit in an 8-byte slot.) IDK how compilers compiling Windows non-vectorcall calls handle this, whether they assume the function probably looks at its args or at least is still responsible for using a vzeroupper at some point even if it doesn't. Probably yes, but if you're writing your own code-gen back-end, or hand-written asm, have a look at what some compilers you care about actually do if this case is relevant for you.

Some compilers optimize by also omitting vzeroupper before returning from a function that took vector args, because clearly the caller is AVX-aware. And crucially, apparently compilers shouldn't expect that calling a function like void foo(__m256i) will leave the CPU in clean-upper state, so the callee does still need a vzeroupper after such a function, before call printf or whatever.

Compilers have options to control `vzeroupper` usage

For example, GCC -mno-vzeroupper / clang -mllvm -x86-use-vzeroupper=0. (The default is -mvzeroupper to do the behaviour described above, using when it might be needed.)

This is implied by -march=knl (Knight's Landing) because it's not needed and very slow on Xeon Phi CPUs (thus should actively be avoided).

Or you might possibly want it if you build libc (and any other libraries you use) with -mavx -mno-veroupper. glibc has some hand-written asm for functions like strlen, but most of those have AVX2 versions. So as long as you're not on an AVX1-only CPU, legacy-SSE versions of string functions might not get used at all.

For MSVC, you should definitely prefer using -arch:AVX when compiling code that uses AVX intrinsics. I think some versions of MSVC could generate code that caused transition penalties if you mixed __m128 and __m256 without /arch:AVX. But beware that that option will make even 128-bit intrinsics like _mm_add_ps use the AVX encoding (vaddps) instead of legacy SSE (addps), though, and will let the compiler auto-vectorize with AVX. There is undocumented switch /d2vzeroupper to enable automatic vzeroupper generation (default), /d2vzeroupper- disables it - see What is the /d2vzeroupper MSVC compiler optimization flag doing?

Corner case where MSVC and GCC/clang can be tricked into executing a legacy-SSE encoding that writes an XMM register with dirty uppers:

Compiler heuristics may be assuming that there will be a VEX encoding available for any instruction in a function that's definitely (unconditionally) already executed AVX instructions. But that's not the case; some, like cvtpi2ps xmm, mm (MMX+SSE) or movqd2d xmm, mm (SSE2) don't have VEX forms. Nor does _mm_sha1rnds4_epu32 - it was first introduced on Silvermont-family which didn't support AVX until Gracemont (Alder Lake), so it was introduced with 128-bit non-VEX encoding and still hasn't got a VEX encoding.

#include <immintrin.h>

void bar(char *dst, char *src)
{
      __m256 vps = _mm256_loadu_ps((float*)src);
      _mm256_storeu_ps((float*)dst, _mm256_sqrt_ps(vps));

#if defined(__SHA__) || defined(_MSC_VER)
        __m128i t1 = _mm_loadu_si128((__m128i*)&src[32]);
                 // possible MSVC bug, writing an XMM with a legacy VEX while an upper might be dirty
        __m128i t2 = _mm_sha1rnds4_epu32(t1,t1, 3);  // only a non-VEX form exists
        t1 = _mm_add_epi8(t1,t2);
        _mm_storeu_si128((__m128i*)&dst[32], t1);
#endif
#ifdef __MMX__  // MSVC for some reason dropped MMX support in 64-bit mode; IDK if it defines __MMX__ even in 32-bit but whatever
        __m128 tmpps = _mm_loadu_ps((float*)&src[48]);
        tmpps = _mm_cvtpi32_ps(tmpps, *(__m64*)&src[48]);
        _mm_storeu_ps((float*)&dst[48], tmpps);
#endif

}

(This is not a sensible way to use SHA or cvtpi2ps, just randomly using vpaddb to force some extra register copying.)

Godbolt

# clang -O3 -march=icelake-client
bar(char*, char*):
        vsqrtps ymm0, ymmword ptr [rsi]
        vmovups ymmword ptr [rdi], ymm0   # first block, AVX1

        vmovdqu xmm0, xmmword ptr [rsi + 32]
        vmovdqa xmm1, xmm0
        sha1rnds4       xmm1, xmm0, 3     # non-VEX encoding while uppers still dirty.
        vpaddb  xmm0, xmm1, xmm0
        vmovdqu xmmword ptr [rdi + 32], xmm0

        vmovups xmm0, xmmword ptr [rsi + 48]
        movdq2q mm0, xmm0
        cvtpi2ps        xmm0, mm0         # again same thing
        vmovups xmmword ptr [rdi + 48], xmm0
        vzeroupper                        # vzeroupper not done until here, too late for code in this function.
        ret

MSVC and GCC are about the same. (Although GCC optimizes away the use of the MMX register in this case, using vcvtdq2ps / vshufps. That presumably wouldn't always happen.)

These are compiler bugs that should be fixed in the compiler, although you may be able to work around them with _mm256_vzeroupper() in specific cases if necessary.

Normally compiler heuristics work fine; e.g. the asm block for if(a) _mm256... will end with a vzeroupper if later code in the function might conditionally run legacy SSE encodings of normal instructions like paddb. (This is only possible with MSVC; gcc/clang require functions containing AVX1 / 2 instructions to be compiled with __attribute__((target("avx"))) or "avx2", which lets them use vpaddb for _mm_add_epi8 anywhere in the function. You have to branch / dispatch based on CPU features on a per-function level, which makes sense because normally you'd want to run a whole loop with AVX or not.)

Libidinous answered 11/8, 2021 at 8:12 Comment(23)

Wrote to Agner, he replied he will mention that the compiler may add _mm256_zeroupper automatically with the next manual update – Employment 11/8, 2021 at 9:56

@AlexGuteniev: Hopefully he'll actually say that vzeroupper (the asm instruction) is added automatically by the compiler. _mm256_vzeroupper is an intrinsic, and compiler don't work by transforming the source, they work by emitting asm. It makes little to no sense to say that _mm256_vzeroupper() is added automatically, just that compilers understand SSE-AVX transition effects well enough that it's not needed. – Libidinous 11/8, 2021 at 10:34

There is a good reason to disable automatic generation of vzeroupper - when you call your own AVX-vectorized functions between different translation units. If a function doesn't take or return vectors the compiler has to assume it expects legacy SSE state and generate vzeroupper. In this case one should disable automatic vzeroupper and insert the intrinsic manually for the given TUs where it matters. You can leave it enabled for other TUs. – Becalmed 13/8, 2021 at 22:23

@PeterCordes, it is updated. You'd be disappointed: The compiler may or may not insert _mm256_zeroupper() automatically. The assembly output from the compiler will tell what it does – Employment 1/9, 2021 at 14:19

I assume compilers would only put _mm256_zeroupper() at the end of the function, right? Meaning if you need to insert it anywhere else (to guard against unavailable instructions), you need to do it manually? Might be worth mentioning that. – Chatterer 28/12, 2022 at 0:26

@user541686: no, compilers will put it before calls to unknown functions which don't have __m256 args or return values, and thus might not have been compiled to use AVX instructions. – Libidinous 28/12, 2022 at 2:57

@PeterCordes: Sure I guess, but that seems beside the point I was trying to make - if you have any function with blocks like if (has_avx) { ... } else { ... } the compiler couldn't possibly know (in general) to avoid _mm256_zeroupper() for the non-AVX path, right? (imagine a more complicated code structure than this... obviously some heuristics will work for trivial cases) – Chatterer 28/12, 2022 at 3:4

@user541686: Oh, you mean in MSVC without /arch:AVX, so the compiler could generate legacy SSE and VEX encodings in the same function. (GCC / clang can't do that in the first place, you can only use AVX intrinsics in functions with __attribute__((target("avx"))). The compiler should know to use vzeroupper along any paths of execution that could be reachable with dirty uppers if they contain legacy-SSE encodings. e.g. if you have if/else on two separate booleans, MSVC ends the first if block with vzeroupper. godbolt.org/z/eW5ecs581 – Libidinous 28/12, 2022 at 4:3

@user541686: This is fine unless you use intrinsics like _mm_sha1msg1_epu32 (SHA) or _mm_cvtpi32_ps (MMX+SSE) that don't have VEX encodings available even with AVX1, then it looks like MSVC and clang generate code that will execute a non-VEX instruction writing an XMM register when uppers are dirty, in a function that starts with unconditional use of AVX1 then 128-bit SHA or MMX. So that's a compiler bug; mixing legacy-SSE with 128-bit VEX is fine, but only with clean uppers. – Libidinous 28/12, 2022 at 4:3

@PeterCordes: Interesting. The kind of code I had in mind was more like this: godbolt.org/z/6G6KTE5Ev Notice the lack of vzeroupper after a bunch of AVX moves toward the end. I suppose you could argue that the __m256 outside implies to the compiler that the whole function must require AVX, but imposing that restriction can prevent some useful functions from being written without code bloat. – Chatterer 28/12, 2022 at 5:45

@user541686: If I'm reading the asm correctly, MSVC is unconditionally loading from and storing the same data back to the uninitialized value and val1,val2 if the code before the printf doesn't write them. (If have_avx is false, it unconditionally executes vmovaps ymm, as well as calling printf without having run vzeroupper). These are the vmovups ymm0, _value$[esp+96] / vmovups _value$1$[esp+96], ymm0 load/store, where _value$ and _value$1$ are both -32. If have_avx and have_sse2 are both false, it ends up at $LN10@foo: where it executes vmovups and vmovaps. – Libidinous 28/12, 2022 at 5:59

@user541686: So yes, your example leads to a call to printf with dirty uppers, potentially stalling in library code using SSE2. But only from a caller that's unusably broken for CPUs that don't support AVX. That somehow confuses the compiler. But if the bool was something else, not a CPU-detection but just saying whether to do some specific work or not before and after printf, you'd presumably have the same problem from an uninitialized __m256 that's only assigned to and read inside an if. Yup: godbolt.org/z/GW1zvfcTn . I'd call that an MSVC bug. – Libidinous 28/12, 2022 at 6:2

(Branching on CPU features inside a single function always seemed dicey to me if you want the optimizer to be able to optimize, that's why GCC/clang don't support it. MSVC chooses not to optimize intrinsics much at all, but mixing SSE/AVX encodings in one function makes things hard for it even then. I wouldn't recommend it.) – Libidinous 28/12, 2022 at 6:7

Interesting, thanks. At the risk of going on the tangent - another problem I have with specializing for CPUs at the whole-function level (aside from the code bloat) is that there seem to be a combinatorial explosion of them when you account for both AMD and Intel. Not only are there already a gazillion variants with SSE2/3/4/... and AVX2/3/512x (and that's excluding older variants on 32-bit), but there are also other features like POPCNT, FSRM, ERMSB, BMI, BMI2, etc. Depending on the application there can be just too many features used in a function to codegen every combination. – Chatterer 28/12, 2022 at 6:36

@user541686: For any given loop (or straight-line function like in video encoders such as x264), usually only a few features are relevant, or at least important enough to be worth making another version for. If you had to do a conditional branch just for two 128-bit copies vs. one 256-bit outside a loop like your example, it's often better to just go with the baseline 128-bit. – Libidinous 28/12, 2022 at 7:6

Intel and AMD have added most features in the same order, so e.g. you're not excluding many CPUs by requiring AVX2 as well as FMA and AVX1 (Piledriver / Steamroller), which would need special tuning anyway because of inefficiencies in Piledriver 256-bit load/store. But yeah stuff like BMI2 pdep not being fast on AMD until Zen 3 sucks, kind of similar to how some early SSSE3 CPUs had pshufb but it and other shuffles weren't fast, so you'd want tuning based on actual CPU, not just feature bits. Same for Zen1 having slow lane-crossing 256-bit shuffles, something Zen4 avoided with AVX512 – Libidinous 28/12, 2022 at 7:8

Thanks, yeah. This reminds me to go on another tangent. Is there any sort of library for feature detection that takes these nuances into account? I mean: aside from raw CPU support, there can be so many semi-orthogonal axes to consider: feature availability vs. enablement, CPU vs. OS level support, microcode updates (like TSX disablement), efficiency on numerous CPU product lines (Atom, Core, Celeron, Athlon, Ryzen, etc.) as well as virtualization/emulators to consider. A mortal like me just can't possibly track all of these for all models, OSes, firmware versions, etc. forever... – Chatterer 28/12, 2022 at 7:33

@user541686: Not that I know of, unfortunately. x264 is open source and has detection that includes detecting "slowshuffle", meaning pshufb exists but isn't always worth it (and some other shuffles with granularity narrower than 64-bit aren't as fast). Stuff like TSX disabling via microcode clears the CPUID feature bit, or at least it should. IIRC Skylake pentium/celeron had an erratum that BMI1/2 showed up as available when they're actually not; disabling AVX on those low-end CPUs was presumably done by disabling decode of VEX prefixes entirely, which also takes out BMI1/2 :( – Libidinous 28/12, 2022 at 7:40

Looks like some compliers don't bother to insert VZEROUPPER in non-optimized builds. godbolt.org/z/ff5qv111M. Reported as the MSVC STL bug github.com/microsoft/STL/issues/3601 , not sure how maintainers would decide to proceed – Employment 5/4, 2023 at 6:45

@AlexGuteniev: Interesting, even with an __m256i tmp = i1 local var, still no vzeroupper from MSVC or GCC without optimization. godbolt.org/z/a15W9qj5z . Non-optimized builds are generally disastrous for performance of intrinsics (because passing args to wrapper functions introduces even more store/reloads; even though they inline, they may not optimize away those temporary objects). So a missing vzeroupper might not be a problem. I guess that could lead to false dependencies on Skylake in un-optimized code that calls an optimized library which uses legacy SSE code. – Libidinous 5/4, 2023 at 7:2

The original repro includes a floating point example, which compiles to SSE, so no debug/release mixing is even needed to recreate godbolt.org/z/roWxcYrPx. I think it still might be a problem, as the perf disaster of unoptimized build multiplies with missing vzeroupper, though I'm fine if compiler vendors wouldn't think so – Employment 5/4, 2023 at 7:13

@AlexGuteniev: Oh right, MSVC without -arch:AVX can create disaster on its own, unlike GCC unless you use __attribute__((target("avx"))) in some functions. But yeah, good point about scalar FP using XMM registers. With the extra store/reload bottlenecks from debug builds, the SKL false dependencies might not be much of a problem, though. And transition stalls on Haswell/Ice Lake only happen once per switch between 256-bit and then legacy SSE code, so is much less likely to create a problem inside a tight loop. – Libidinous 5/4, 2023 at 7:32

@AlexGuteniev: I'm curious whether there are any realistic codebases where missing vzeroupper creates more than a few percent additional slowdown in a debug build, beyond how slow a debug build already is. All else equal, it's better if debug builds are less slow, especially for programs like games or other things with real-time requirements to be tested & debugged, though, so even a few percent overall isn't something compiler devs should ignore. – Libidinous 5/4, 2023 at 7:34

AVX -> SSE penalty without zeroing applies to the current processors. See Intel® 64 and IA-32 Architectures Optimization Reference Manual, June 2021.

However, missing _mm256_zeroupper() in C/C++ code is not necessarily a problem. Compiler may insert it by itself. All compilers do: https://godbolt.org/z/veToerhvG

Experiments show that automatic vzeroupper insertion works in VS 2015, but does not work in VS 2012

Employment answered 11/8, 2021 at 6:18 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Intrinsics in C or C++ source

Compilers have options to control vzeroupper usage

Corner case where MSVC and GCC/clang can be tricked into executing a legacy-SSE encoding that writes an XMM register with dirty uppers:

Recommended topics

Hot tags

Compilers have options to control `vzeroupper` usage