Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

Asked 23/12, 2016 at 15:9 Answered 28/12, 2016 at 9:52

I've been trying to figure out a performance problem in an application and have finally narrowed it down to a really weird problem. The following piece of code runs 6 times slower on a Skylake CPU (i5-6500) if the VZEROUPPER instruction is commented out. I've tested Sandy Bridge and Ivy Bridge CPUs and both versions run at the same speed, with or without VZEROUPPER.

Now I have a fairly good idea of what VZEROUPPER does and I think it should not matter at all to this code when there are no VEX coded instructions and no calls to any function which might contain them. The fact that it does not on other AVX capable CPUs appears to support this. So does table 11-2 in the Intel® 64 and IA-32 Architectures Optimization Reference Manual

So what is going on?

The only theory I have left is that there's a bug in the CPU and it's incorrectly triggering the "save the upper half of the AVX registers" procedure where it shouldn't. Or something else just as strange.

This is main.cpp:

#include <immintrin.h>

int slow_function( double i_a, double i_b, double i_c );

int main()
{
    /* DAZ and FTZ, does not change anything here. */
    _mm_setcsr( _mm_getcsr() | 0x8040 );

    /* This instruction fixes performance. */
    __asm__ __volatile__ ( "vzeroupper" : : : );

    int r = 0;
    for( unsigned j = 0; j < 100000000; ++j )
    {
        r |= slow_function( 
                0.84445079384884236262,
                -6.1000481519580951328,
                5.0302160279288017364 );
    }
    return r;
}

and this is slow_function.cpp:

#include <immintrin.h>

int slow_function( double i_a, double i_b, double i_c )
{
    __m128d sign_bit = _mm_set_sd( -0.0 );
    __m128d q_a = _mm_set_sd( i_a );
    __m128d q_b = _mm_set_sd( i_b );
    __m128d q_c = _mm_set_sd( i_c );

    int vmask;
    const __m128d zero = _mm_setzero_pd();

    __m128d q_abc = _mm_add_sd( _mm_add_sd( q_a, q_b ), q_c );

    if( _mm_comigt_sd( q_c, zero ) && _mm_comigt_sd( q_abc, zero )  )
    {
        return 7;
    }

    __m128d discr = _mm_sub_sd(
        _mm_mul_sd( q_b, q_b ),
        _mm_mul_sd( _mm_mul_sd( q_a, q_c ), _mm_set_sd( 4.0 ) ) );

    __m128d sqrt_discr = _mm_sqrt_sd( discr, discr );
    __m128d q = sqrt_discr;
    __m128d v = _mm_div_pd(
        _mm_shuffle_pd( q, q_c, _MM_SHUFFLE2( 0, 0 ) ),
        _mm_shuffle_pd( q_a, q, _MM_SHUFFLE2( 0, 0 ) ) );
    vmask = _mm_movemask_pd(
        _mm_and_pd(
            _mm_cmplt_pd( zero, v ),
            _mm_cmple_pd( v, _mm_set1_pd( 1.0 ) ) ) );

    return vmask + 1;
}

The function compiles down to this with clang:

 0:   f3 0f 7e e2             movq   %xmm2,%xmm4
 4:   66 0f 57 db             xorpd  %xmm3,%xmm3
 8:   66 0f 2f e3             comisd %xmm3,%xmm4
 c:   76 17                   jbe    25 <_Z13slow_functionddd+0x25>
 e:   66 0f 28 e9             movapd %xmm1,%xmm5
12:   f2 0f 58 e8             addsd  %xmm0,%xmm5
16:   f2 0f 58 ea             addsd  %xmm2,%xmm5
1a:   66 0f 2f eb             comisd %xmm3,%xmm5
1e:   b8 07 00 00 00          mov    $0x7,%eax
23:   77 48                   ja     6d <_Z13slow_functionddd+0x6d>
25:   f2 0f 59 c9             mulsd  %xmm1,%xmm1
29:   66 0f 28 e8             movapd %xmm0,%xmm5
2d:   f2 0f 59 2d 00 00 00    mulsd  0x0(%rip),%xmm5        # 35 <_Z13slow_functionddd+0x35>
34:   00 
35:   f2 0f 59 ea             mulsd  %xmm2,%xmm5
39:   f2 0f 58 e9             addsd  %xmm1,%xmm5
3d:   f3 0f 7e cd             movq   %xmm5,%xmm1
41:   f2 0f 51 c9             sqrtsd %xmm1,%xmm1
45:   f3 0f 7e c9             movq   %xmm1,%xmm1
49:   66 0f 14 c1             unpcklpd %xmm1,%xmm0
4d:   66 0f 14 cc             unpcklpd %xmm4,%xmm1
51:   66 0f 5e c8             divpd  %xmm0,%xmm1
55:   66 0f c2 d9 01          cmpltpd %xmm1,%xmm3
5a:   66 0f c2 0d 00 00 00    cmplepd 0x0(%rip),%xmm1        # 63 <_Z13slow_functionddd+0x63>
61:   00 02 
63:   66 0f 54 cb             andpd  %xmm3,%xmm1
67:   66 0f 50 c1             movmskpd %xmm1,%eax
6b:   ff c0                   inc    %eax
6d:   c3                      retq

The generated code is different with gcc but it shows the same problem. An older version of the intel compiler generates yet another variation of the function which shows the problem too but only if main.cpp is not built with the intel compiler as it inserts calls to initialize some of its own libraries which probably end up doing VZEROUPPER somewhere.

And of course, if the whole thing is built with AVX support so the intrinsics are turned into VEX coded instructions, there is no problem either.

I've tried profiling the code with perf on linux and most of the runtime usually lands on 1-2 instructions but not always the same ones depending on which version of the code I profile (gcc, clang, intel). Shortening the function appears to make the performance difference gradually go away so it looks like several instructions are causing the problem.

EDIT: Here's a pure assembly version, for linux. Comments below.

    .text
    .p2align    4, 0x90
    .globl _start
_start:

    #vmovaps %ymm0, %ymm1  # This makes SSE code crawl.
    #vzeroupper            # This makes it fast again.

    movl    $100000000, %ebp
    .p2align    4, 0x90
.LBB0_1:
    xorpd   %xmm0, %xmm0
    xorpd   %xmm1, %xmm1
    xorpd   %xmm2, %xmm2

    movq    %xmm2, %xmm4
    xorpd   %xmm3, %xmm3
    movapd  %xmm1, %xmm5
    addsd   %xmm0, %xmm5
    addsd   %xmm2, %xmm5
    mulsd   %xmm1, %xmm1
    movapd  %xmm0, %xmm5
    mulsd   %xmm2, %xmm5
    addsd   %xmm1, %xmm5
    movq    %xmm5, %xmm1
    sqrtsd  %xmm1, %xmm1
    movq    %xmm1, %xmm1
    unpcklpd    %xmm1, %xmm0
    unpcklpd    %xmm4, %xmm1

    decl    %ebp
    jne    .LBB0_1

    mov $0x1, %eax
    int $0x80

Ok, so as suspected in comments, using VEX coded instructions causes the slowdown. Using VZEROUPPER clears it up. But that still does not explain why.

As I understand it, not using VZEROUPPER is supposed to involve a cost to transition to old SSE instructions but not a permanent slowdown of them. Especially not such a large one. Taking loop overhead into account, the ratio is at least 10x, perhaps more.

I have tried messing with the assembly a little and float instructions are just as bad as double ones. I could not pinpoint the problem to a single instruction either.

Maraca answered 23/12, 2016 at 15:9 Comment(18)

What compiler flags are you using? Perhaps the (hidden) process initialization is using some VEX instructions which is putting you in a mixed state from which you never exit. You could try copy/pasting the assembly and building it as a pure assembly program with _start, so that you avoid any of the compiler-inserted init code and see if it exhibits the same issue. – Melanosis 23/12, 2016 at 20:35

@Melanosis I use -O3 -ffast-math but the effect is present even with -O0. I will try with pure assembly. You might be on to something as I just found out on Agner's blog that there have been some large internal changes to how VEX transitions are handled... will need to look into that. – Maraca 23/12, 2016 at 20:53

Yes - but the oddness is that on Skylake the penalties are supposed to be greatly reduced for running in the "bad" mixed modes - but I didn't re-read it yet to refresh my memory on the details. – Melanosis 24/12, 2016 at 19:7

@Melanosis that was my understanding as well but it's clearly not that simple. It seems to also have some kind of register specific tracking. In the asm version I added, if I introduce vxorps in the loop to clear ymm0-ymm5, it becomes faster but still not as fast as the VEX-free version. – Maraca 27/12, 2016 at 16:56

What did you find when you ran it as pure assembly? Also, have you checked the contents of ymm* on entry to your routine? Curious if the upper bits are zero or something else. – Melanosis 27/12, 2016 at 16:58

@Melanosis Everything is zero on entry. The only real finding is that somewhere in the code before main() there must have been a VEX instruction which I now have to insert myself to get the slowdown. This is scary as it implies that any piece of code in any library can permanently slow down my entire application long after I'm done calling it. At least until we move to AVX. – Maraca 27/12, 2016 at 17:18

It's really weird it causes a (huge) and permanent slowdown. The pre-Skylake slowdowns were all about VEX transitions, so if you had code that didn't have any transitions itself, at most you could get a small(ish) fixed slowdown once when you started your loop (since a transition could occur), but then you shouldn't have further slowdowns after that. At least as far as I understood. – Melanosis 27/12, 2016 at 17:21

I finally got off my ass and read the doc. The penalty is discussed pretty clearly in Intel's manual and while different for Skylake, it is not necessary better - and in your case it is much worse. I added the details in an answer. – Melanosis 27/12, 2016 at 17:54

I don't understand why you get an AVX instruction unless you compiled main.cpp with AVX and slow_funciton.cpp without e.g g++ -c -O3 slow_function.cpp and then g++ -O3 -mavx slow_function.o main.cpp – Aromaticity 30/12, 2016 at 9:12

@Zboson the AVX instruction is in the dynamic linker but I don't know why they put it there either. See my comment on BeeOnRope's answer. It's a fairly ugly problem. – Maraca 30/12, 2016 at 13:57

That's strange. Awesome that you figure that out. What do you though by a pure assembly version? You mean you wrote the assembly yourself? – Aromaticity 30/12, 2016 at 14:31

@Zboson - since the OP already had the assembly output (from objdump or whatever), the assumption was that he could just copy it to an assembly file and compile it (along with adding a _start symbol, etc, to make it run standalone). – Melanosis 30/12, 2016 at 17:36

@Zboson cut & pasted from the dump, reworked to avoid using any lib and trimmed down to keep only the essential part (removed function call, constants and branches which did not matter, etc). – Maraca 2/1, 2017 at 4:13

@Melanosis I think I got it now, that answer my next question. Presumably the OP used this assembly to test on other systems because the SNB and IVB systems might not have had the same /lib64/ld-linux-x86-64.so.2. – Aromaticity 2/1, 2017 at 9:22

Can you say exactly what SNB and IVB you tested on? More importantly, can you test on Haswell? It would be very interesting to see what you find on Haswell. – Aromaticity 2/1, 2017 at 9:36

@zboson It think it is more like the dynamic linker method that's called when you compile as C/C++ isn't called when you compile as pure assembly. There's always a bunch of hidden runtime stuff that gets invoked when you are using C/C++ and I think this is included in that. – Melanosis 2/1, 2017 at 14:2

@Olivier, can you tell me how you figured out the problem was at _dl_runtime_resolve_avx(), /lib64/ld-linux-x86-64.so.2. The reason I ask is I think I see this problem somewhere else but I want to narrow it down. – Aromaticity 19/2, 2017 at 20:54

@Zboson I think at some point my test case was slow a printf() in main() before the test loop and fast without. I traced in gdb with stepi and quickly landed in that function full of avx code and no vzeroupper. A few searches later, I had found the glibc issue which clearly said there was a problem there. I have since found that memset() is equally problematic but don't know why (the code looks ok). – Maraca 20/2, 2017 at 19:58

You are experiencing a penalty for "mixing" non-VEX SSE and VEX-encoded instructions - even though your entire visible application doesn't obviously use any AVX instructions!

Prior to Skylake, this type of penalty was only a one-time transition penalty, when switching from code that used vex to code that didn't, or vice-versa. That is, you never paid an ongoing penalty for whatever happened in the past unless you were actively mixing VEX and non-VEX. In Skylake, however, there is a state where non-VEX SSE instructions pay a high ongoing execution penalty, even without further mixing.

Straight from the horse's mouth, here's Figure 11-1 ¹ - the old (pre-Skylake) transition diagram:

As you can see, all of the penalties (red arrows), bring you to a new state, at which point there is no longer a penalty for repeating that action. For example, if you get to the dirty upper state by executing some 256-bit AVX, an you then execute legacy SSE, you pay a one-time penalty to transition to the preserved non-INIT upper state, but you don't pay any penalties after that.

In Skylake, everything is different per Figure 11-2:

There are fewer penalties overall, but critically for your case, one of them is a self-loop: the penalty for executing a legacy SSE (Penalty A in the Figure 11-2) instruction in the dirty upper state keeps you in that state. That's what happens to you - any AVX instruction puts you in the dirty upper state, which slows all further SSE execution down.

Here's what Intel says (section 11.3) about the new penalty:

The Skylake microarchitecture implements a different state machine than prior generations to manage the YMM state transition associated with mixing SSE and AVX instructions. It no longer saves the entire upper YMM state when executing an SSE instruction when in “Modified and Unsaved” state, but saves the upper bits of individual register. As a result, mixing SSE and AVX instructions will experience a penalty associated with partial register dependency of the destination registers being used and additional blend operation on the upper bits of the destination registers.

So the penalty is apparently quite large - it has to blend the top bits all the time to preserve them, and it also makes instructions which are apparently independently become dependent, since there is a dependency on the hidden upper bits. For example xorpd xmm0, xmm0 no longer breaks the dependence on the previous value of xmm0, since the result is actually dependent on the hidden upper bits from ymm0 which aren't cleared by the xorpd. That latter effect is probably what kills your performance since you'll now have very long dependency chains that wouldn't expect from the usual analysis.

This is among the worst type of performance pitfall: where the behavior/best practice for the prior architecture is essentially opposite of the current architecture. Presumably the hardware architects had a good reason for making the change, but it does just add another "gotcha" to the list of subtle performance issues.

I would file a bug against the compiler or runtime that inserted that AVX instruction and didn't follow up with a VZEROUPPER.

Update: Per the OP's comment below, the offending (AVX) code was inserted by the runtime linker ld and a bug already exists.

¹ From Intel's optimization manual.

Melanosis answered 27/12, 2016 at 17:53 Comment(10)

Great! I got confused by first reading an older version of the manual without the Skylake comments and then the newer version not far enough. Doesn't help that the newer version has fewer pages than the old one. I will definitely track down the offending lib. – Maraca 27/12, 2016 at 18:46

The offending code is in _dl_runtime_resolve_avx(), /lib64/ld-linux-x86-64.so.2 . Seems like this should sort itself out with the next release of glibc: sourceware.org/bugzilla/show_bug.cgi?id=20495 – Maraca 27/12, 2016 at 19:10

Interesting enough VZEROUPPER is not recommended on KNL but the situation is being debated software.intel.com/en-us/forums/intel-isa-extensions/topic/… – Aromaticity 28/12, 2016 at 8:38

Why does the OP get an avx instruction in main.cpp and not in slow_function.cpp unless he compiled main.cpp with AVX and slow_function.cpp without? GCC should not insert AVX instruction unless it is told to because it would generate SIGILL on systems without AVX. – Aromaticity 30/12, 2016 at 9:20

@Zboson - I didn't see anywhere the OP was compiling the two files with different AVX flags? He said that he doesn't get the issue if he enables AVX compilation, which makes sense since the only penalties on Skylake are for legacy SSE execution (Penalty A). Furthermore, the instructions aren't inserted by the compiler (you won't find them by inspecting the binary), but instead occur at runtime due to some method which is called inside the runtime linker, as Olivier mentions above (I added the link also to the end of my answer). – Melanosis 30/12, 2016 at 17:39

@Melanosis So, if I will mix ymm and xmm register, I will encounter the penalty? Suppose I need to move 48 bytes of data. If I use the following sequence of instructions: "vmovdqu ymm0, ymmword ptr [rcx]; movups xmm1, xmmword ptr [rcx+32]; vmovdqu ymmword ptr [rdx], ymm0; movups xmmword ptr [rdx+32], xmm1", will I encounter the penalty? – Reannareap 9/5, 2017 at 19:12

@MaximMasiutin something is wrong with the formatting of your code, so I can't parse it - but, briefly, the problem isn't with mixing ymm and xmm registers it is about mixing non-VEX and VEX encoded instructions. Anything with ymm is VEX-encoded, and anything with almost anything with 3-arguments is VEX-encoded, and I think all the vector instructions starting with v are VEX-encoded. So in your example if you change it to use vmovdqu xmm, ... you should be OK since that form is VEX-encoded. – Melanosis 9/5, 2017 at 19:48

@Melanosis As far as I understood from the answer the transition VEX to non-VEX is perfromed each time upper part of ymm registers is seen as non-zero and then xsaveing to some memory. How does CPU know which memory location to use for such state saving? – Revere 10/1, 2020 at 11:12

XSAVE takes a memory argument which is how it knows where to save. XSAVE doesn't play much into the penalties, although I didn't understand what you meant. XSAVE preserves the dirty/clean state of the upper bits. – Melanosis 10/1, 2020 at 13:29

@Melanosis is the penalty for reading still present if you mix VEX + read only SSE? In particular there are a few cases I want to save the code size with movups %xmm0, (%mem) as opposed to vmovdqu %xmm0, (%mem) after initializing %ymm0. (Penalty aside from a false dependency on hi128 of ymm0). – Belshazzar 26/9, 2021 at 16:50

I just made some experiments (on a Haswell). The transition between clean and dirty states is not expensive, but the dirty state makes every non-VEX vector operation dependent on the previous value of the destination register. In your case, for example movapd %xmm1, %xmm5 will have a false dependency on ymm5 which prevents out-of-order execution. This explains why vzeroupper is needed after AVX code.

Gabby answered 28/12, 2016 at 9:52 Comment(10)

You are one of the heroes of this site's [x86] tag. Avid followers of the tag quote you extensively here, since you're one of the rare sources on microarchitectural details of x86 processors. Keep up your good work! – Herat 28/12, 2016 at 9:59

Cool, but the new behavior described above (second diagram with hidden register dependencies) is apparently only for Skylake and newer? On Haswell it is supposed to save the upper halves away somewhere so that subsequent non-VEX operations are fast. – Melanosis 28/12, 2016 at 14:14

I don't have access to test a Skylake at the moment. – Gabby 29/12, 2016 at 14:33

@BeeOnRope, The OP said he did not have the problem on Sandy Bridge and Ivy Bridge, only on Skylake. The OP did not test Haswell. But Agner sees a problem on Haswell. So I am a bit confused because I would expect Haswell to act like Sandy Bridge and Ivy Bridge in this case. – Aromaticity 30/12, 2016 at 9:17

Yes, I'm confused too. – Melanosis 30/12, 2016 at 14:21

Is it possible that Haswell actually behaves like Skylake, but nobody described the behaviour until SKL came out? Or that it sometimes behaves this way? Any chance it's only a factor during the warm-up period before the upper halves of the 256b execution units power up? Maybe the state-transition behaviour is different during the period where AVX-256 instructions are slow? I just got a SKL desktop, and I have access to a Haswell laptop, so I may find some time to test this. Unfortunately I can't compare with IvB or SnB, which I assume do work the way you and Intel describe it. – Duffie 31/12, 2016 at 16:41

Peter, the Haswell has a cost of 70 clock cycles for every state transition when VEX and non-VEX code is mixed, just like Sandy and Ivy Bridge. Skylake does not have any delay on state transitions, but I think it has the same false dependence as I described for Haswell. – Gabby 1/1, 2017 at 19:32

@PeterCordes, it's possible Intel's documentation is wrong. It would not be the first time. This should be easy to test. Run the OP's assembly on a SNB or IVB system and HSW system and compare. Maybe the OP has access to a HSW system. – Aromaticity 2/1, 2017 at 9:35

@Melanosis - I have made a question stackoverflow.com/questions/43879935/… -- thanks. – Reannareap 9/5, 2017 at 21:11

Just as a fun fact (going to bed now, just digging, ping me if anyone cares) - it seems Skylake with/without the microcode patch to disable the loop stream decoder makes a difference (SOMEHOW) too - you have no idea how painful working out the cause has been, but I can now get a result reliably so... it is that. – Airsickness 6/2, 2019 at 22:23

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags