Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says

Asked 29/9, 2020 at 9:25 Answered 29/9, 2020 at 10:2

performance x86-64 intel cpu-architecture avx

In Intel Intrinsics Guide, vmulpd and vfmadd213pd has latency of 5, vaddpd has latency of 3.

I write some test code, but all of the results are 1 cycle slower.

Here is my test code:

.CODE
test_latency PROC
    vxorpd  ymm0, ymm0, ymm0
    vxorpd  ymm1, ymm1, ymm1

loop_start:
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    sub     rcx, 4
    jg      loop_start

    ret
test_latency ENDP
END

#include <stdio.h>
#include <omp.h>
#include <stdint.h>
#include <windows.h>

extern "C" void test_latency(int64_t n);

int main()
{
    SetThreadAffinityMask(GetCurrentThread(), 1);   // Avoid context switch
    
    int64_t n = (int64_t)3e9;
    double start = omp_get_wtime();
    test_latency(n);
    double end = omp_get_wtime();
    double time = end - start;
    
    double freq = 3.3e9;    // My CPU frequency
    double latency = freq * time / n;
    printf("latency = %f\n", latency);
}

My CPU is Core i5 4590, I locked its frequency at 3.3GHz. The output is: latency = 6.102484.

Strange enough, if I change vmulpd ymm0, ymm0, ymm1 to vmulpd ymm0, ymm0, ymm0, then the output become: latency = 5.093745.

Is there an explanation? Is my test code problematic?

MORE RESULTS

results on Core i5 4590 @3.3GHz
vmulpd  ymm0, ymm0, ymm1       6.056094
vmulpd  ymm0, ymm0, ymm0       5.054515
vaddpd  ymm0, ymm0, ymm1       4.038062
vaddpd  ymm0, ymm0, ymm0       3.029360
vfmadd213pd ymm0, ymm0, ymm1   6.052501
vfmadd213pd ymm0, ymm1, ymm0   6.053163
vfmadd213pd ymm0, ymm1, ymm1   6.055160
vfmadd213pd ymm0, ymm0, ymm0   5.041532

(without vzeroupper)
vmulpd  xmm0, xmm0, xmm1       6.050404
vmulpd  xmm0, xmm0, xmm0       5.042191
vaddpd  xmm0, xmm0, xmm1       4.044518
vaddpd  xmm0, xmm0, xmm0       3.024233
vfmadd213pd xmm0, xmm0, xmm1   6.047219
vfmadd213pd xmm0, xmm1, xmm0   6.046022
vfmadd213pd xmm0, xmm1, xmm1   6.052805
vfmadd213pd xmm0, xmm0, xmm0   5.046843

(with vzeroupper)
vmulpd  xmm0, xmm0, xmm1       5.062350
vmulpd  xmm0, xmm0, xmm0       5.039132
vaddpd  xmm0, xmm0, xmm1       3.019815
vaddpd  xmm0, xmm0, xmm0       3.026791
vfmadd213pd xmm0, xmm0, xmm1   5.043748
vfmadd213pd xmm0, xmm1, xmm0   5.051424
vfmadd213pd xmm0, xmm1, xmm1   5.049090
vfmadd213pd xmm0, xmm0, xmm0   5.051947

(without vzeroupper)
mulpd   xmm0, xmm1             5.047671
mulpd   xmm0, xmm0             5.042176
addpd   xmm0, xmm1             3.019492
addpd   xmm0, xmm0             3.028642

(with vzeroupper)
mulpd   xmm0, xmm1             5.046220
mulpd   xmm0, xmm0             5.057278
addpd   xmm0, xmm1             3.025577
addpd   xmm0, xmm0             3.031238

MY GUESS

I changed test_latency like this:

.CODE
test_latency PROC
    vxorpd  ymm0, ymm0, ymm0
    vxorpd  ymm1, ymm1, ymm1

loop_start:
    vaddpd  ymm1, ymm1, ymm1  ; added this line
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    sub     rcx, 4
    jg      loop_start

    ret
test_latency ENDP
END

Finally I get the result of 5 cycle. There are other instructions to achieve the same effect:

vmovupd     ymm1, ymm0
vmovupd     ymm1, [mem]
vmovdqu     ymm1, [mem]
vxorpd      ymm1, ymm1, ymm1
vpxor       ymm1, ymm1, ymm1
vmulpd      ymm1, ymm1, ymm1
vshufpd     ymm1, ymm1, ymm1, 0

But these instructions cannot:

vmovupd     ymm1, ymm2  ; suppose ymm2 is zeroed
vpaddq      ymm1, ymm1, ymm1
vpmulld     ymm1, ymm1, ymm1
vpand       ymm1, ymm1, ymm1

In the case of ymm instructions, I guess the conditions to avoid 1 extra cycle are:

All inputs are from the same domain.
All inputs are fresh enough. (move from old value doesn't work)

As for VEX xmm, the condition seems a little blur. It seems related to upper half state, but I don't know which one is cleaner:

vxorpd      ymm1, ymm1, ymm1
vxorpd      xmm1, xmm1, xmm1
vzeroupper

Hard question to me.

Bonds answered 29/9, 2020 at 9:25 Comment(3)

Your further tests all show that if you read a register without writing it, it's "extra latency" property can remain for the whole loop, affecting the dependency chain through the other operand. (And also that vzeroupper can clear this property on Haswell. It doesn't on Skylake.) – Leckie 29/9, 2020 at 13:42

@PeterCordes Actually vzeroupper can only change the latency of vmulpd xmm0, xmm0, xmm1; it makes no change on vmulpd ymm0, ymm0, ymm1. So I am still curious. – Bonds 29/9, 2020 at 14:4

Interesting. On Skylake, vzeroupper doesn't fix xmm either, still slow if the read-only register is polluted. But Skylake uses a different SSE/AVX transition strategy than Haswell so it's very plausible that vzeroupper has different implementation details that lead to this being different as well. – Leckie 29/9, 2020 at 14:11

I've been meaning to write something up about this for a few years now, since noticing it on Skylake. https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-Quirks#after-an-integer-to-fp-bypass-latency-can-be-increased-indefinitely

Bypass-delay latency is "sticky": an integer SIMD instruction can "infect" all future instructions that read that value, even long after the instruction is done. I'm surprised that "infection" survived across a zeroing idiom, especially an FP zeroing instruction like vxorpd, but I can reproduce that effect on SKL (i7-6700k, counting clock cycles directly in a test loop with perf on Linux instead of messing around with time and frequency.)

(On Skylake, it seems 3 or more vxorpd zeroing instructions in a row before the loop happen to work, removing the extra bypass latency. AFAIK, xor-zeroing is always eliminated, unlike mov-elimination which sometimes fails. But perhaps the difference is just in creating a gap between issue of the vpaddb into the back-end and the first vmulpd; in my test loop I "dirty" / pollute the register right before the loop.)

(update: trying my test code again now, even one vxorps seems to clean the register. Perhaps a microcode update changed something.)

Presumably some previous use of YMM1 in the caller involved an integer instruction. (TODO: investigate how common it is for a register to get into this state, and when it can survive xor-zeroing! I expected it to only happen when constructing an FP bit-pattern with integer instructions, including stuff like vpcmpeqd ymm1,ymm1,ymm1 to make a -NaN (all-one bits).)

On Skylake I can fix it by doing vaddpd ymm1, ymm1, ymm1 before the loop, after the xor-zeroing. (Or before; it might not matter! That might be more optimal, putting it at the end of the previous dep chain instead of the start of this.)

As I wrote in a comment on another question

xsave/rstor can fix the issue where writing a register with a SIMD-integer instruction like paddd creates extra latency indefinitely for reading it with an FP instruction, affecting latency from both inputs. e.g. paddd xmm0, xmm0 then in a loop addps xmm1, xmm0 has 5c latency instead of the usual 4, until the next save/restore.

It's bypass latency but still happens even if you don't touch the register until after the paddd has definitely retired (by padding with >ROB uops) before the loop.

Test program:

; taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread -r1 ./bypass-latency

default rel
global _start
_start:
    vmovaps   xmm1, [one]        ; FP load into ymm1 (zeroing the upper lane)
    vpaddd    ymm1, ymm1,ymm0   ; ymm1 written in the ivec domain
    ;vxorps    ymm1, ymm1,ymm1   ; In 2017, ymm1 still makes vaddps slow (5c) after this
    ; but I can't reproduce that now with updated microcode.
    vxorps    ymm0, ymm0, ymm0   ; zeroing-idiom on ymm0
    mov       rcx, 50000000

align 32  ; doesn't help or hurt, as expected since the bottleneck isn't frontend
.loop:
    vaddps  ymm0, ymm0,ymm1
    vaddps  ymm0, ymm0,ymm1
    dec     rcx
    jnz .loop

    xor edi,edi
    mov eax,231
    syscall      ; exit_group(0)

section .rodata
align 16
one:            times 4 dd 1.0

Perf results a static executable on i7-6700k:

 Performance counter stats for './foo' (4 runs):

            129.01 msec task-clock                #    0.998 CPUs utilized            ( +-  0.51% )
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.016 K/sec                  
       500,053,798      cycles                    #    3.876 GHz                      ( +-  0.00% )
        50,000,042      branches                  #  387.576 M/sec                    ( +-  0.00% )
       200,000,059      instructions              #    0.40  insn per cycle           ( +-  0.00% )
       150,020,084      uops_issued.any           # 1162.883 M/sec                    ( +-  0.00% )
       150,014,866      uops_executed.thread      # 1162.842 M/sec                    ( +-  0.00% )

          0.129244 +- 0.000670 seconds time elapsed  ( +-  0.52% )

500M cycles for 50M iterations = 10 cycle loop-carried dependency for 2x vaddps, or 5 each.

Leckie answered 29/9, 2020 at 10:2 Comment(6)

I tried to add vaddpd ymm1, ymm1, ymm1, both before or after vxorpd, but the latency of vmulpd ymm0, ymm0, ymm1 is still 6. – Bonds 29/9, 2020 at 11:8

@kevinjwz: I unfortunately don't have a working Haswell system to test on, but I can repro this on Skylake. vpaddb ymm1, ymm1, ymm1 before the loop "infects" the register, making it slow. vaddpd ymm1, ymm1, ymm1 right after that makes it fast again (4 cycles per vmulpd; Skylake has 4c latency for mul/add/FMA, dropping the 3c latency dedicated FP add unit that Haswell had). And I can confirm that vxorpd-zeroing after vpaddb does not clean the register!! (An FP shuffle does, though, like vunpcklpd. Or 3 or more repeats of xor-zeroing. Very mysterious.) – Leckie 29/9, 2020 at 11:16

re: "On Skylake, it seems 3 or more vxorpd zeroing instructions in a row before the loop happen to work, removing the extra bypass latency" have you tested with 1x vxorpd + nop fill to see if its really just seperating the decode groups? – Razzledazzle 16/4, 2021 at 15:26

@Noah: No, I haven't yet. Can you repro the effect on your Whiskey Lake machine? (And/or Ice Lake?) – Leckie 16/4, 2021 at 18:43

Can you post the benchmark code somewhere and I can try. – Razzledazzle 16/4, 2021 at 18:45

@Noah: Oh, forgot I'd been lazy and didn't include it in the question :P Updated. Also, it seems 1x vxorps clears the problem now, at least in this specific case. That contradicts the note I wrote in comments on it at the time, so this might well be a real change. Or an alignment effect of _start being at a different address with current ld? – Leckie 16/4, 2021 at 19:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Test program:

Recommended topics

Hot tags