Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
Asked Answered
P

1

61

I'm a newbie at instruction optimization.

I did a simple analysis on a simple function dotp which is used to get the dot product of two float arrays.

The C code is as follows:

float dotp(               
    const float  x[],   
    const float  y[],     
    const short  n      
)
{
    short i;
    float suma;
    suma = 0.0f;

    for(i=0; i<n; i++) 
    {    
        suma += x[i] * y[i];
    } 
    return suma;
}

I use the test frame provided by Agner Fog on the web testp.

The arrays which are used in this case are aligned:

int n = 2048;
float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
char *mem = (char*)_mm_malloc(1<<18,4096);
char *a = mem;
char *b = a+n*sizeof(float);
char *c = b+n*sizeof(float);

float *x = (float*)a;
float *y = (float*)b;
float *z = (float*)c;

Then I call the function dotp, n=2048, repeat=100000:

 for (i = 0; i < repeat; i++)
 {
     sum = dotp(x,y,n);
 }

I compile it with gcc 4.8.3, with the compile option -O3.

I compile this application on a computer which does not support FMA instructions, so you can see there are only SSE instructions.

The assembly code:

.L13:
        movss   xmm1, DWORD PTR [rdi+rax*4]  
        mulss   xmm1, DWORD PTR [rsi+rax*4]   
        add     rax, 1                       
        cmp     cx, ax
        addss   xmm0, xmm1
        jg      .L13

I do some analysis:

          μops-fused  la    0    1    2    3    4    5    6    7    
movss       1          3             0.5  0.5
mulss       1          5   0.5  0.5  0.5  0.5
add         1          1   0.25 0.25               0.25   0.25 
cmp         1          1   0.25 0.25               0.25   0.25
addss       1          3         1              
jg          1          1                                   1                                                   -----------------------------------------------------------------------------
total       6          5    1    2     1     1      0.5   1.5

After running, we get the result:

   Clock  |  Core cyc |  Instruct |   BrTaken | uop p0   | uop p1      
--------------------------------------------------------------------
542177906 |609942404  |1230100389 |205000027  |261069369 |205511063 
--------------------------------------------------------------------  
   2.64   |  2.97     | 6.00      |     1     | 1.27     |  1.00   

   uop p2   |    uop p3   |  uop p4 |    uop p5  |  uop p6    |  uop p7       
-----------------------------------------------------------------------   
 205185258  |  205188997  | 100833  |  245370353 |  313581694 |  844  
-----------------------------------------------------------------------          
    1.00    |   1.00      | 0.00    |   1.19     |  1.52      |  0.00           

The second line is the value read from the Intel registers; the third line is divided by the branch number, "BrTaken".

So we can see, in the loop there are 6 instructions, 7 uops, in agreement with the analysis.

The numbers of uops run in port0 port1 port 5 port6 are similar to what the analysis says. I think maybe the uops scheduler does this, it may try to balance loads on the ports, am I right?

I absolutely do not understand know why there are only about 3 cycles per loop. According to Agner's instruction table, the latency of instruction mulss is 5, and there are dependencies between the loops, so as far as I see it should take at least 5 cycles per loop.

Could anyone shed some insight?

==================================================================

I tried to write an optimized version of this function in nasm, unrolling the loop by a factor of 8 and using the vfmadd231ps instruction:

.L2:
    vmovaps         ymm1, [rdi+rax]             
    vfmadd231ps     ymm0, ymm1, [rsi+rax]       

    vmovaps         ymm2, [rdi+rax+32]          
    vfmadd231ps     ymm3, ymm2, [rsi+rax+32]    

    vmovaps         ymm4, [rdi+rax+64]          
    vfmadd231ps     ymm5, ymm4, [rsi+rax+64]    

    vmovaps         ymm6, [rdi+rax+96]          
    vfmadd231ps     ymm7, ymm6, [rsi+rax+96]   

    vmovaps         ymm8, [rdi+rax+128]         
    vfmadd231ps     ymm9, ymm8, [rsi+rax+128]  

    vmovaps         ymm10, [rdi+rax+160]               
    vfmadd231ps     ymm11, ymm10, [rsi+rax+160] 

    vmovaps         ymm12, [rdi+rax+192]                
    vfmadd231ps     ymm13, ymm12, [rsi+rax+192] 

    vmovaps         ymm14, [rdi+rax+224]                
    vfmadd231ps     ymm15, ymm14, [rsi+rax+224] 
    add             rax, 256                    
    jne             .L2

The result:

  Clock   | Core cyc |  Instruct  |  BrTaken  |  uop p0   |   uop p1  
------------------------------------------------------------------------
 24371315 |  27477805|   59400061 |   3200001 |  14679543 |  11011601  
------------------------------------------------------------------------
    7.62  |     8.59 |  18.56     |     1     | 4.59      |     3.44


   uop p2  | uop p3  |  uop p4  |   uop p5  |   uop p6   |  uop p7  
-------------------------------------------------------------------------
 25960380  |26000252 |  47      |  537      |   3301043  |  10          
------------------------------------------------------------------------------
    8.11   |8.13     |  0.00    |   0.00    |   1.03     |  0.00        

So we can see the L1 data cache reach 2*256bit/8.59, it is very near to the peak 2*256/8, the usage is about 93%, the FMA unit only used 8/8.59, the peak is 2*8/8, the usage is 47%.

So I think I've reached the L1D bottleneck as Peter Cordes expects.

==================================================================

Special thanks to Boann, fix so many grammatical errors in my question.

=================================================================

From Peter's reply, I get it that only "read and written" register would be the dependence, "writer-only" registers would not be the dependence.

So I try to reduce the registers used in loop, and I try to unrolling by 5, if everything is ok, I should meet the same bottleneck, L1D.

.L2:
    vmovaps         ymm0, [rdi+rax]    
    vfmadd231ps     ymm1, ymm0, [rsi+rax]    

    vmovaps         ymm0, [rdi+rax+32]    
    vfmadd231ps     ymm2, ymm0, [rsi+rax+32]   

    vmovaps         ymm0, [rdi+rax+64]    
    vfmadd231ps     ymm3, ymm0, [rsi+rax+64]   

    vmovaps         ymm0, [rdi+rax+96]    
    vfmadd231ps     ymm4, ymm0, [rsi+rax+96]   

    vmovaps         ymm0, [rdi+rax+128]    
    vfmadd231ps     ymm5, ymm0, [rsi+rax+128]   

    add             rax, 160                    ;n = n+32
    jne             .L2 

The result:

    Clock  | Core cyc  | Instruct  |  BrTaken |    uop p0  |   uop p1  
------------------------------------------------------------------------  
  25332590 |  28547345 |  63700051 |  5100001 |   14951738 |  10549694   
------------------------------------------------------------------------
    4.97   |  5.60     | 12.49     |    1     |     2.93   |    2.07    

    uop p2  |uop p3   | uop p4 | uop p5 |uop p6   |  uop p7 
------------------------------------------------------------------------------  
  25900132  |25900132 |   50   |  683   | 5400909 |     9  
-------------------------------------------------------------------------------     
    5.08    |5.08     |  0.00  |  0.00  |1.06     |     0.00    

We can see 5/5.60 = 89.45%, it is a little smaller than urolling by 8, is there something wrong?

=================================================================

I try to unroll loop by 6, 7 and 15, to see the result. I also unroll by 5 and 8 again, to double confirm the result.

The result is as follow, we can see this time the result is much better than before.

Although the result is not stable, the unrolling factor is bigger and the result is better.

            | L1D bandwidth     |  CodeMiss | L1D Miss | L2 Miss 
----------------------------------------------------------------------------
  unroll5   | 91.86% ~ 91.94%   |   3~33    | 272~888  | 17~223
--------------------------------------------------------------------------
  unroll6   | 92.93% ~ 93.00%   |   4~30    | 481~1432 | 26~213
--------------------------------------------------------------------------
  unroll7   | 92.29% ~ 92.65%   |   5~28    | 336~1736 | 14~257
--------------------------------------------------------------------------
  unroll8   | 95.10% ~ 97.68%   |   4~23    | 363~780  | 42~132
--------------------------------------------------------------------------
  unroll15  | 97.95% ~ 98.16%   |   5~28    | 651~1295 | 29~68

=====================================================================

I try to compile the function with gcc 7.1 in the web "https://gcc.godbolt.org"

The compile option is "-O3 -march=haswell -mtune=intel", that is similar to gcc 4.8.3.

.L3:
        vmovss  xmm1, DWORD PTR [rdi+rax]
        vfmadd231ss     xmm0, xmm1, DWORD PTR [rsi+rax]
        add     rax, 4
        cmp     rdx, rax
        jne     .L3
        ret
Perfumery answered 15/7, 2017 at 1:14 Comment(9)
Upvote for the research effort.Vinita
There's two execution units that can perform FP multiplies on Haswell so two MULSS instructions can run in parallel. There's no dependency between MULSS instructions in each loop iteration.Semantic
@Ross Ridge, yes,I get it with the reply of Peter Cordes, the dependence is xmm0, so addss is the bottleneck.Perfumery
Yes, nice job on the unrolled FMA loop. I added a section about that in my answer. You can shrink the code-size and number of fused-domain uops, but you probably can't get much closer to saturating p2/p3 uop throughput which limits you to two L1D loads per cycle feeding an average of one FMA per cycle. I updated my answer to make it clearer that reusing registers is fine with write-only instructions. Your FMA loop uses a lot of architectural registers as load destinations for no benefit. (But only a code-size downside).Robedechambre
BTW, I'd recommend getting a newer compiler. gcc4.8 is pretty old. gcc6.4 or gcc7.1 are current, and usually do a better job, especially in AVX/AVX2 auto-vectorization. But also with C intrinsics, you can get more efficient code from newer gcc. When the compiler does something weird or cool, it can be interesting to compare it with clang or other gcc versions.Robedechambre
@PeterCordes, the server, on which I compile it, is a redhat 6 or 7, sorry I forget the excat version, I will go to my company and check it tomorrow, gcc 4.8.3 is provided with the release version, I'm afraid if I upgrade gcc to 6.4 or 7.1, do I have to upgrade libc or binutil?Perfumery
@Forward: IDK, I luckily haven't had to do much with old software. If it's inconvenient and possibly risky to upgrade, then maybe test on your desktop whether a new gcc makes your program run faster. If not, then keep using gcc4.8 until you want to use a C++14 or C++1z feature it doesn't have, or something. But if it does, then it's worth looking into. Even an upgrade to the most recent gcc4.x (4.9.4 or something) would probably help, and gcc 5.4 would also be an improvement. I don't think you should have to upgrade binutils or libc, if you can find a RHEL7 RPM of new gcc or build from src.Robedechambre
Generally you want a compiler newer than the hardware, so they've had time to update the tuning options for -march=native. And fix some makes-slow-code problems that might only be noticed after AVX2 has been around for a while. I think a lot of people use old compilers with ok results, though. Maybe I make too much of a big deal about it, but when I look at compiler asm output, newer gcc often does better. Often in ways that won't really matter overall, though.Robedechambre
-march=haswell implies -mtune=haswell. Adding -mtune=intel makes it more generic. Either way, it won't auto-vectorize unless you use -ffast-math, because FP math is not associative, and auto-vectorization changes the order of summing the results. (BTW, when I suggested using a newer compiler, I wasn't meaning just for this tiny loop. No guarantee that current gcc will do anything better for the loop body, but it may avoid 32B-aligning the stack if it doesn't spill anything. I forget...)Robedechambre
R
45

Related:


Look at your loop again: movss xmm1, src has no dependency on the old value of xmm1, because its destination is write-only. Each iteration's mulss is independent. Out-of-order execution can and does exploit that instruction-level parallelism, so you definitely don't bottleneck on mulss latency.

Optional reading: In computer architecture terms: register renaming avoids the WAR anti-dependency data hazard of reusing the same architectural register. (Some pipelining + dependency-tracking schemes before register renaming didn't solve all the problems, so the field of computer architecture makes a big deal out of different kinds of data hazards.

Register renaming with Tomasulo's algorithm makes everything go away except the actual true dependencies (read after write), so any instruction where the destination is not also a source register has no interaction with the dependency chain involving the old value of that register. (Except for false dependencies, like popcnt on Intel CPUs, and writing only part of a register without clearing the rest (like mov al, 5 or sqrtss xmm2, xmm1). Related: Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?).


Back to your code:

.L13:
    movss   xmm1, DWORD PTR [rdi+rax*4]  
    mulss   xmm1, DWORD PTR [rsi+rax*4]   
    add     rax, 1                       
    cmp     cx, ax
    addss   xmm0, xmm1
    jg      .L13

The loop-carried dependencies (from one iteration to the next) are each:

  • xmm0, read and written by addss xmm0, xmm1, which has 3 cycle latency on Haswell.
  • rax, read and written by add rax, 1. 1c latency, so it's not the critical-path.

It looks like you measured the execution time / cycle-count correctly, because the loop bottlenecks on the 3c addss latency.

This is expected: the serial dependency in a dot product is the addition into a single sum (aka the reduction), not the multiplies between vector elements. (Unrolling with multiple sum accumulator variables / registers can hide that latency.)

That is by far the dominant bottleneck for this loop, despite various minor inefficiencies:


short i produced the silly cmp cx, ax, which takes an extra operand-size prefix. Luckily, gcc managed to avoid actually doing add ax, 1, because signed-overflow is Undefined Behaviour in C. So the optimizer can assume it doesn't happen. (update: integer promotion rules make it different for short, so UB doesn't come into it, but gcc can still legally optimize. Pretty wacky stuff.)

If you'd compiled with -mtune=intel, or better, -march=haswell, gcc would have put the cmp and jg next to each other where they could macro-fuse.

I'm not sure why you have a * in your table on the cmp and add instructions. (update: I was purely guessing that you were using a notation like IACA does, but apparently you weren't). Neither of them fuse. The only fusion happening is micro-fusion of mulss xmm1, [rsi+rax*4].

And since it's a 2-operand ALU instruction with a read-modify-write destination register, it stays macro-fused even in the ROB on Haswell. (Sandybridge would un-laminate it at issue time.) Note that vmulss xmm1, xmm1, [rsi+rax*4] would un-laminate on Haswell, too.

None of this really matters, since you just totally bottleneck on FP-add latency, much slower than any uop-throughput limits. Without -ffast-math, there's nothing compilers can do. With -ffast-math, clang will usually unroll with multiple accumulators, and it will auto-vectorize so they will be vector accumulators. So you can probably saturate Haswell's throughput limit of 1 vector or scalar FP add per clock, if you hit in L1D cache.

With FMA being 5c latency and 0.5c throughput on Haswell, you would need 10 accumulators to keep 10 FMAs in flight and max out FMA throughput by keeping p0/p1 saturated with FMAs. (Skylake reduced FMA latency to 4 cycles, and runs multiply, add, and FMA on the FMA units. So it actually has higher add latency than Haswell.)

(You're bottlenecked on loads, because you need two loads for every FMA. In other cases, you can actually gain add throughput by replacing some a vaddps instruction with an FMA with a multiplier of 1.0. This means more latency to hide, so it's best in a more complex algorithm where you have an add that's not on the critical path in the first place.)


Re: uops per port:

there are 1.19 uops per loop in the port 5, it is much more than expect 0.5, is it the matter about the uops dispatcher trying to make uops on every port same

Yes, something like that.

The uops are not assigned randomly, or somehow evenly distributed across every port they could run on. You assumed that the add and cmp uops would distribute evenly across p0156, but that's not the case.

The issue stage assigns uops to ports based on how many uops are already waiting for that port. Since addss can only run on p1 (and it's the loop bottleneck), there are usually a lot of p1 uops issued but not executed. So few other uops will ever be scheduled to port1. (This includes mulss: most of the mulss uops will end up scheduled to port 0.)

Taken-branches can only run on port 6. Port 5 doesn't have any uops in this loop that can only run there, so it ends up attracting a lot of the many-port uops.

The scheduler (which picks unfused-domain uops out of the Reservation Station) isn't smart enough to run critical-path-first, so this is assignment algorithm reduces resource-conflict latency (other uops stealing port1 on cycles when an addss could have run). It's also useful in cases where you bottleneck on the throughput of a given port.

Scheduling of already-assigned uops is normally oldest-ready first, as I understand it. This simple algorithm is hardly surprising, since it has to pick a uop with its inputs ready for each port from a 60-entry RS every clock cycle, without melting your CPU. The out-of-order machinery that finds and exploits the ILP is one of the significant power costs in a modern CPU, comparable to the execution units that do the actual work.

Related / more details: How are x86 uops scheduled, exactly?


More performance analysis stuff:

Other than cache misses / branch mispredicts, the three main possible bottlenecks for CPU-bound loops are:

  • dependency chains (like in this case)
  • front-end throughput (max of 4 fused-domain uops issued per clock on Haswell)
  • execution port bottlenecks, like if lots of uops need p0/p1, or p2/p3, like in your unrolled loop. Count unfused-domain uops for specific ports. Generally you can assuming best-case distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some.

A loop body or short block of code can be approximately characterized by 3 things: fused-domain uop count, unfused-domain count of which execution units it can run on, and total critical-path latency assuming best-case scheduling for its critical path. (Or latencies from each of input A/B/C to the output...)

For example of doing all three to compare a few short sequences, see my answer on What is the efficient way to count set bits at a position or lower?

For short loops, modern CPUs have enough out-of-order execution resources (physical register file size so renaming doesn't run out of registers, ROB size) to have enough iterations of a loop in-flight to find all the parallelism. But as dependency chains within loops get longer, eventually they run out. See Measuring Reorder Buffer Capacity for some details on what happens when a CPU runs out of registers to rename onto.

See also lots of performance and reference links in the tag wiki.


Tuning your FMA loop:

Yes, dot-product on Haswell will bottleneck on L1D throughput at only half the throughput of the FMA units, since it takes two loads per multiply+add.

If you were doing B[i] = x * A[i] + y; or sum(A[i]^2), you could saturate FMA throughput.

It looks like you're still trying to avoid register reuse even in write-only cases like the destination of a vmovaps load, so you ran out of registers after unrolling by 8. That's fine, but could matter for other cases.

Also, using ymm8-15 can slightly increase code-size if it means a 3-byte VEX prefix is needed instead of 2-byte. Fun fact: vpxor ymm7,ymm7,ymm8 needs a 3-byte VEX while vpxor ymm8,ymm8,ymm7 only needs a 2-byte VEX prefix. For commutative ops, sort source regs from high to low.

Our load bottleneck means the best-case FMA throughput is half the max, so we need at least 5 vector accumulators to hide their latency. 8 is good, so there's plenty of slack in the dependency chains to let them catch up after any delays from unexpected latency or competition for p0/p1. 7 or maybe even 6 would be fine, too: your unroll factor doesn't have to be a power of 2.

Unrolling by exactly 5 would mean that you're also right at the bottleneck for dependency chains. Any time an FMA doesn't run in the exact cycle its input is ready means a lost cycle in that dependency chain. This can happen if a load is slow (e.g. it misses in L1 cache and has to wait for L2), or if loads complete out of order and an FMA from another dependency chain steals the port this FMA was scheduled for. (Remember that scheduling happens at issue time, so the uops sitting in the scheduler are either port0 FMA or port1 FMA, not an FMA that can take whichever port is idle).

If you leave some slack in the dependency chains, out-of-order execution can "catch up" on the FMAs, because they won't be bottlenecked on throughput or latency, just waiting for load results. @Forward found (in an update to the question) that unrolling by 5 reduced performance from 93% of L1D throughput to 89.5% for this loop.

My guess is that unroll by 6 (one more than the minimum to hide the latency) would be ok here, and get about the same performance as unroll by 8. If we were closer to maxing out FMA throughput (rather than just bottlenecked on load throughput), one more than the minimum might not be enough.

update: @Forward's experimental test shows my guess was wrong. There isn't a big difference between unroll5 and unroll6. Also, unroll15 is twice as close as unroll8 to the theoretical max throughput of 2x 256b loads per clock. Measuring with just independent loads in the loop, or with independent loads and register-only FMA, would tell us how much of that is due to interaction with the FMA dependency chain. Even the best case won't get perfect 100% throughput, if only because of measurement errors and disruption due to timer interrupts. (Linux perf measures only user-space cycles unless you run it as root, but time still includes time spent in interrupt handlers. This is why your CPU frequency might be reported as 3.87GHz when run as non-root, but 3.900GHz when run as root and measuring cycles instead of cycles:u.)


We aren't bottlenecked on front-end throughput, but we can reduce the fused-domain uop count by avoiding indexed addressing modes for non-mov instructions. Fewer is better and makes this more hyperthreading-friendly when sharing a core with something other than this.

The simple way is just to do two pointer-increments inside the loop. The complicated way is a neat trick of indexing one array relative to the other:

;; input pointers for x[] and y[] in rdi and rsi
;; size_t n  in rdx

    ;;; zero ymm1..8, or load+vmulps into them

    add             rdx, rsi             ; end_y
    ; lea rdx, [rdx+rsi-252]  to break out of the unrolled loop before going off the end, with odd n

    sub             rdi, rsi             ; index x[] relative to y[], saving one pointer increment

.unroll8:
    vmovaps         ymm0, [rdi+rsi]            ; *px, actually py[xy_offset]
    vfmadd231ps     ymm1, ymm0, [rsi]          ; *py

    vmovaps         ymm0,       [rdi+rsi+32]   ; write-only reuse of ymm0
    vfmadd231ps     ymm2, ymm0, [rsi+32]

    vmovaps         ymm0,       [rdi+rsi+64]
    vfmadd231ps     ymm3, ymm0, [rsi+64]

    vmovaps         ymm0,       [rdi+rsi+96]
    vfmadd231ps     ymm4, ymm0, [rsi+96]

    add             rsi, 256       ; pointer-increment here
                                   ; so the following instructions can still use disp8 in their addressing modes: [-128 .. +127] instead of disp32
                                   ; smaller code-size helps in the big picture, but not for a micro-benchmark

    vmovaps         ymm0,       [rdi+rsi+128-256]  ; be pedantic in the source about compensating for the pointer-increment
    vfmadd231ps     ymm5, ymm0, [rsi+128-256]
    vmovaps         ymm0,       [rdi+rsi+160-256]
    vfmadd231ps     ymm6, ymm0, [rsi+160-256]
    vmovaps         ymm0,       [rdi+rsi-64]       ; or not
    vfmadd231ps     ymm7, ymm0, [rsi-64]
    vmovaps         ymm0,       [rdi+rsi-32]
    vfmadd231ps     ymm8, ymm0, [rsi-32]

    cmp             rsi, rdx
    jb              .unroll8                 ; } while(py < endy);

Using a non-indexed addressing mode as the memory operand for vfmaddps lets it stay micro-fused in the out-of-order core, instead of being un-laminated at issue. Micro fusion and addressing modes

So my loop is 18 fused-domain uops for 8 vectors. Yours takes 3 fused-domain uops for each vmovaps + vfmaddps pair, instead of 2, because of un-lamination of indexed addressing modes. Both of them still of course have 2 unfused-domain load uops (port2/3) per pair, so that's still the bottleneck.

Fewer fused-domain uops lets out-of-order execution see more iterations ahead, potentially helping it absorb cache misses better. It's a minor thing when we're bottlenecked on an execution unit (load uops in this case) even with no cache misses, though. But with hyperthreading, you only get every other cycle of front-end issue bandwidth unless the other thread is stalled. If it's not competing too much for load and p0/1, fewer fused-domain uops will let this loop run faster while sharing a core. (e.g. maybe the other hyper-thread is running a lot of port5 / port6 and store uops?)

Since un-lamination happens after the uop-cache, your version doesn't take extra space in the uop cache. A disp32 with each uop is ok, and doesn't take extra space. But bulkier code-size means the uop-cache is less likely to pack as efficiently, since you'll hit 32B boundaries before uop cache lines are full more often. (Actually, smaller code doesn't guarantee better either. Smaller instructions could lead to filling a uop cache line and needing one entry in another line before crossing a 32B boundary.) This small loop can run from the loopback buffer (LSD), so fortunately the uop-cache isn't a factor.


Then after the loop: Efficient cleanup is the hard part of efficient vectorization for small arrays that might not be a multiple of the unroll factor or especially the vector width

    ...
    jb

    ;; If `n` might not be a multiple of 4x 8 floats, put cleanup code here
    ;; to do the last few ymm or xmm vectors, then scalar or an unaligned last vector + mask.

    ; reduce down to a single vector, with a tree of dependencies
    vaddps          ymm1, ymm2, ymm1
    vaddps          ymm3, ymm4, ymm3
    vaddps          ymm5, ymm6, ymm5
    vaddps          ymm7, ymm8, ymm7

    vaddps          ymm0, ymm3, ymm1
    vaddps          ymm1, ymm7, ymm5

    vaddps          ymm0, ymm1, ymm0

    ; horizontal within that vector, low_half += high_half until we're down to 1
    vextractf128    xmm1, ymm0, 1
    vaddps          xmm0, xmm0, xmm1
    vmovhlps        xmm1, xmm0, xmm0        
    vaddps          xmm0, xmm0, xmm1
    vmovshdup       xmm1, xmm0
    vaddss          xmm0, xmm1
    ; this is faster than 2x vhaddps

    vzeroupper    ; important if returning to non-AVX-aware code after using ymm regs.
    ret           ; with the scalar result in xmm0

For more about the horizontal sum at the end, see Fastest way to do horizontal SSE vector sum (or other reduction). The two 128b shuffles I used don't even need an immediate control byte, so it saves 2 bytes of code size vs. the more obvious shufps. (And 4 bytes of code-size vs. vpermilps, because that opcode always needs a 3-byte VEX prefix as well as an immediate). AVX 3-operand stuff is very nice compared the SSE, especially when writing in C with intrinsics so you can't as easily pick a cold register to movhlps into.

Robedechambre answered 15/7, 2017 at 4:30 Comment(19)
Hi, Peter Cordes, thanks very much, I get it the dependency is the register xmm0, and the addss is the bottleneck. At the beginning, I see cmp and add could run in port0, port1,port5,port5, so I put a * on cmp and add to show it could run in many ports... well I do not know there is a special meaning about "*", I have fixed it.Perfumery
what do you think of that, actually there are 1.19 uops per loop in the port 5, it is much more than expect 0.5, is it the matter about the uops dispatcher trying to make uops on every port same ?Perfumery
Ah, I didn't read the whole text of your question, since it was long and I could see the answer from the code, especially after Ross's hint. IACA uses a notation like 2^ to indicate micro-fusion, but it's not a standard thing at all. It was just my first guess when trying to figure out what you were aiming for.Robedechambre
@Forward: Added a section to my answer about how uops are assigned to ports and where you went wrong. Good question, and good guess.Robedechambre
i++ when i is 2^15-1 and i has been declared short is not UB. i++ expands to i = (short) ((int) i + 1); and the implementation-defined behavior of the overflow in the conversion from int to short must occur. GCC's code transformation is nevertheless correct.Gomuti
@PeterCordes, thanks for your patient and professional analysis,I need sometime to try it and understand you.Perfumery
@Forward: yeah, I didn't limit this answer to beginner-level stuff :P This seemed like a good place to try to write a canonical version of how to count latency, front-end uops, and execution-port uops. And then if I'm going to link here from other answers, I might as well go into lots of interesting details for anyone of any experience level that wants to read them. :) Please ask more good questions like this in the future, if you're still stuck after reading Agner Fog's guides (esp. the microarch one), and searching on SO. There are some good x86 perf answers here (some of them mine :)Robedechambre
@PeterCordes, I unroll the loop by 5, I expect to meet the same bottleneck L1D. But this time the bandwidth of L1D is 89.45%, a little smaller than unrolling by 8. Do you have any idea?Perfumery
@Forward: Good question. Unrolling by 5 is only just barely enough to hide the FMA latency with perfect uop scheduling. Any time an FMA uop doesn't run in the cycle when its input becomes ready, you've lost a cycle in that dependency chain and can never catch back up. Having at least 6 dep chains running leaves some slack for out-of-order execution to "catch up" on any one dep chain after it was delayed due to resource conflicts or whatever.Robedechambre
@Forward: updated my answer with some more stuff about unroll by exactly 5. It would be interesting to test with 6 and see if it's as good as 8 or 10. (My theory is that it should be).Robedechambre
@PeterCordes, I try to unroll loop by 5,6,7,8,15, I find the unrolling factor is bigger and the usage of L1D bandwidth is better, and I aslo find it seems CodeMiss, L1D Miss, L2 Miss do nothing matter about the usage of bandwidth.Perfumery
@Forward: So 15 is measurably faster than 8? That's interesting. I would have thought that 8 would be plenty to hide any scheduling weirdness due to loads being scheduled to ports in a way that produced results in a different order than the FMAs were scheduled. I'm not surprised there are no actual cache misses of any kind (especially not code misses); maxing out load-port throughput is more likely to be about uop scheduling.Robedechambre
@PeterCordes, yes, in my test, 15 is measurably faster than 8, but only a little, you can see the best case in 8 is similar to the worst case in 15.Perfumery
@PeterCordes, you mean you agree with me that, all of the cache misses do not act on the bandwidth usage in these cases?Perfumery
@Forward: But anyway, the failure to reach 100% L1D bandwidth is probably more to do with uop scheduling, and its interaction with dependency chains. That probably explains it even with zero cache misses. (You might want to replace the FMAs with loads, so the only thing in the loop is independent loads with no dependency chains. If that runs faster, then it was the FMA dep chains that were holding things back. Especially if you can put in FMAs with register operands that don't interact with the loads, and that still runs fast.)Robedechambre
@Forward: Oh cool, you put your data in the question. :) Didn't notice when first replying. Unroll15 is almost twice as good as unroll8: half the amount of "lost" throughput, twice as close to theoretical max. Those cache-miss numbers are very very small, like we'd expect, far fewer than one miss per 2k dot-product. It's probably just warm-up effects as your program is starting, and maybe a bit from evictions due to OS timer interrupts. Nothing to worry about, and I think too small to explain being 2% off of theoretical max. So it's probably just uop scheduling.Robedechambre
@Forward: minor update to my answer, mostly just putting in what I just said in comments. Thanks again for doing the work of testing my theories and guesses, this is fun :)Robedechambre
@PeterCordes, I don't get data by perf, I use testp provided by Agner, it read data from a driver in linux kernel, and the driver read Intel PMU register directly.Perfumery
@Forward: Ok. So your counts will include events that happen in kernel interrupt handlers. That may be where the cache misses are happening. IRQs will definitely limit you to less than 100%, but IDK whether the theoretical limit will be 99.9% or 99.1%. I guess you could test with something that's well-known to schedule nearly perfectly, like 8 independent add r32,r32 to just max out the ALUs without touching memory. See how close to 4.0 uops per clock that gets according to your measurement method, and that's your "goal".Robedechambre

© 2022 - 2024 — McMap. All rights reserved.