Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%
Asked Answered
E

1

47

I'm attempting to obtain full bandwidth in the L1 cache for the following function on Intel processors

float triad(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    for(int i=0; i<n; i++) {
        z[i] = x[i] + k*y[i];
    }
}

This is the triad function from STREAM.

I get about 95% of the peak with SandyBridge/IvyBridge processors with this function (using assembly with NASM). However, using Haswell I only achieve 62% of the peak unless I unroll the loop. If I unroll 16 times I get 92%. I don't understand this.

I decided to write my function in assembly using NASM. The main loop in assembly looks like this.

.L2:
    vmovaps         ymm1, [rdi+rax]
    vfmadd231ps     ymm1, ymm2, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2

It turns out in Agner Fog's Optimizing Assembly manual in examples 12.7-12.11 he does almost the same thing (but for y[i] = y[i] +k*x[i]) for the Pentium M, Core 2, Sandy Bridge, FMA4, and FMA3. I managed to reproduce his code more or less on my own (actually he has a small bug in the FMA3 example when he broadcasts). He gives instruction size counts, fused ops , execution ports in tables for each processor except for FMA4 and FMA3. I have tried to make this table myself for FMA3.

                                 ports
             size   μops-fused   0   1   2   3   4   5   6   7    
vmovaps      5      1                    ½   ½
vfmadd231ps  6      1            ½   ½   ½   ½
vmovaps      5      1                            1           1
add          4      ½                                    ½
jne          2      ½                                    ½
--------------------------------------------------------------
total       22      4            ½   ½   1   1   1   0   1   1

Size refers to the instruction length in bytes. The reason the add and jne instructions have half a μop is they get fused into one macro-op (not to be confused with μop fusion which still uses multiple ports) and only need port 6 and one μop. The vfmadd231ps instruction can use port 0 or port 1. I chose port 0. The load vmovaps can use port 2 or 3. I chose 2 and had vfmadd231ps use port 3.. In order to be consistent with Agner Fog's tables and since I think it makes more sense to say an instruction which can go to different ports equally goes to each one 1/2 of the time, I assigned 1/2 for the ports vmovaps and vmadd231ps can go to.

Based on this table and the fact that all Core2 processors can do four μops every clock cycle it appears this loop should be possible every clock cycle but I have not managed to obtain it. Can somebody please explain to me why I can't get close to the peak bandwidth for this function on Haswell without unrolling? Is this possible without unrolling and if so how can it be done? Let me be clear that I'm really trying to maximize the ILP for this function (I don't only want maximum bandwidth) so that's the reason I don't want to unroll.

Edit: Here is an update since Iwillnotexist Idonotexist showed using IACA that the stores never use port 7. I managed to break the 66% barrier without unrolling and do this in one clock cycle every iteration without unrolling(theoretically). Let's first address the store problem.

Stephen Canon mentioned in at comment that the Address Generation Unit (AGU) in port 7 can only handle simple operations such as [base + offset] and not [base + index]. In the Intel optimization reference manual the only thing I found was a comment on port7 which says "Simple_AGU" with no definition of what simple means. But then Iwillnotexist Idonotexist found in the comments of IACA that this problem was already mentioned six months ago in which an employee at Intel wrote on 03/11/2014:

Port7 AGU can only work on stores with simple memory address (no index register).

Stephen Canon suggests "using the store address as the offset for the load operands." I have tried this like this

vmovaps         ymm1, [rdi + r9 + 32*i]
vfmadd231ps     ymm1, ymm2, [rsi + r9 + 32*i]
vmovaps         [r9 + 32*i], ymm1
add             r9, 32*unroll
cmp             r9, rcx
jne             .L2

This indeed causes the store to use port7. However, it has another problem which is that the the vmadd231ps does not fuse with the load which you can see from IACA. It also needs additionally the cmp instruction which my original function did not. So the store uses one less micro-op but the cmp (or rather then add since the cmp macro fuses with the jne) needs one more. IACA reports a block throughput of 1.5. In practice this only get about 57% of the peak.

But I found a way to get the vmadd231ps instruction to fuse with the load as well. This can only be done using static arrays with addressing [absolute 32-bit address + index] like this. Evgeny Kluev original suggested this.

vmovaps         ymm1, [src1_end + rax]
vfmadd231ps     ymm1, ymm2, [src2_end + rax]
vmovaps         [dst_end + rax], ymm1
add             rax, 32
jl              .L2

Where src1_end, src2_end, and dst_end are the end addresses of static arrays.

This reproduces the table in my question with four fused micro-ops that I expected. If you put this into IACA it reports a block throughput of 1.0. In theory this should do as well as the SSE and AVX versions. In practice it gets about 72% of the peak. That breaks the 66% barrier but it's still a long ways from the 92% I get unrolling 16 times. So on Haswell the only option to get close to the peak is to unroll. This is not necessary on Core2 through Ivy Bridge but it is on Haswell.

End_edit:

Here is the C/C++ Linux code to test this. The NASM code is posted after the C/C++ code. The only thing you have to change is the frequency number. In the line double frequency = 1.3; replace 1.3 with whatever the operating (not nominal) frequency of your processors is (which in case for a i5-4250U with turbo disabled in the BIOS is 1.3 GHz).

Compile with

nasm -f elf64 triad_sse_asm.asm
nasm -f elf64 triad_avx_asm.asm
nasm -f elf64 triad_fma_asm.asm
g++ -m64 -lrt -O3 -mfma  tests.cpp triad_fma_asm.o -o tests_fma
g++ -m64 -lrt -O3 -mavx  tests.cpp triad_avx_asm.o -o tests_avx
g++ -m64 -lrt -O3 -msse2 tests.cpp triad_sse_asm.o -o tests_sse

The C/C++ code

#include <x86intrin.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

#define TIMER_TYPE CLOCK_REALTIME

extern "C" float triad_sse_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_sse_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);    
extern "C" float triad_avx_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_avx_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat); 
extern "C" float triad_fma_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_fma_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);

#if (defined(__FMA__))
float triad_fma_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_fmadd_ps(k4, _mm256_load_ps(&y[i]), _mm256_load_ps(&x[i])));
        }
    }
}
#elif (defined(__AVX__))
float triad_avx_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
        }
    }
}
#else
float triad_sse_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m128 k4 = _mm_set1_ps(k);
        for(i=0; i<n; i+=4) {
            _mm_store_ps(&z[i], _mm_add_ps(_mm_load_ps(&x[i]), _mm_mul_ps(k4, _mm_load_ps(&y[i]))));
        }
    }
}
#endif

double time_diff(timespec start, timespec end)
{
    timespec temp;
    if ((end.tv_nsec-start.tv_nsec)<0) {
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
    } else {
        temp.tv_sec = end.tv_sec-start.tv_sec;
        temp.tv_nsec = end.tv_nsec-start.tv_nsec;
    }
    return (double)temp.tv_sec +  (double)temp.tv_nsec*1E-9;
}

int main () {
    int bytes_per_cycle = 0;
    double frequency = 1.3;  //Haswell
    //double frequency = 3.6;  //IB
    //double frequency = 2.66;  //Core2
    #if (defined(__FMA__))
    bytes_per_cycle = 96;
    #elif (defined(__AVX__))
    bytes_per_cycle = 48;
    #else
    bytes_per_cycle = 24;
    #endif
    double peak = frequency*bytes_per_cycle;

    const int n =2048;

    float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
    char *mem = (char*)_mm_malloc(1<<18,4096);
    char *a = mem;
    char *b = a+n*sizeof(float);
    char *c = b+n*sizeof(float);

    float *x = (float*)a;
    float *y = (float*)b;
    float *z = (float*)c;

    for(int i=0; i<n; i++) {
        x[i] = 1.0f*i;
        y[i] = 1.0f*i;
        z[i] = 0;
    }
    int repeat = 1000000;
    timespec time1, time2;
    #if (defined(__FMA__))
    triad_fma_repeat(x,y,z2,n,repeat);
    #elif (defined(__AVX__))
    triad_avx_repeat(x,y,z2,n,repeat);
    #else
    triad_sse_repeat(x,y,z2,n,repeat);
    #endif

    while(1) {
        double dtime, rate;

        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_asm_repeat(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_asm_repeat(x,y,z,n,repeat);
        #else
        triad_sse_asm_repeat(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("unroll1     rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_repeat(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_repeat(x,y,z,n,repeat);
        #else
        triad_sse_repeat(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("intrinsic   rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_asm_repeat_unroll16(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_asm_repeat_unroll16(x,y,z,n,repeat);
        #else
        triad_sse_asm_repeat_unroll16(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("unroll16    rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
    }
}

The NASM code using the System V AMD64 ABI.

triad_fma_asm.asm:

global triad_fma_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;z[i] = y[i] + 3.14159*x[i]
pi: dd 3.14159
;align 16
section .text
    triad_fma_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx 

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmovaps         ymm1, [rdi+rax]
    vfmadd231ps     ymm1, ymm2, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_fma_asm_repeat_unroll16
section .text
    triad_fma_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    vbroadcastss    ymm2, [rel pi]  
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
.L2:
    %assign unroll 32
    %assign i 0 
    %rep    unroll
        vmovaps         ymm1, [r9 + 32*i]
        vfmadd231ps     ymm1, ymm2, [r10 + 32*i]
        vmovaps         [r11 + 32*i], ymm1
    %assign i i+1 
    %endrep
    add             r9, 32*unroll
    add             r10, 32*unroll
    add             r11, 32*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

triad_ava_asm.asm:

global triad_avx_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx 

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_avx_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat2:
    shl             rcx, 2  
    vbroadcastss    ymm2, [rel pi]

align 16
.L1:
    xor             rax, rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             eax, 32
    cmp             eax, ecx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_avx_asm_repeat_unroll16
align 16
section .text
    triad_avx_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    vbroadcastss    ymm2, [rel pi]  
align 16
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
align 16
.L2:
    %assign unroll 16
    %assign i 0 
    %rep    unroll
        vmulps          ymm1, ymm2, [r9 + 32*i]
        vaddps          ymm1, ymm1, [r10 + 32*i]
        vmovaps         [r11 + 32*i], ymm1
    %assign i i+1 
    %endrep
    add             r9, 32*unroll
    add             r10, 32*unroll
    add             r11, 32*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

triad_sse_asm.asm:

global triad_sse_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
;align 16
section .text
    triad_sse_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
    ;neg                rcx 
align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    movaps          xmm1, [rdi+rax]
    mulps           xmm1, xmm2
    addps           xmm1, [rsi+rax]
    movaps          [rdx+rax], xmm1
    add             rax, 16
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret

global triad_sse_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
;align 16
section .text
    triad_sse_asm_repeat2:
    shl             rcx, 2  
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
align 16
.L1:
    xor             rax, rax
align 16
.L2:
    movaps          xmm1, [rdi+rax]
    mulps           xmm1, xmm2
    addps           xmm1, [rsi+rax]
    movaps          [rdx+rax], xmm1
    add             eax, 16
    cmp             eax, ecx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret



global triad_sse_asm_repeat_unroll16
section .text
    triad_sse_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
.L2:
    %assign unroll 8
    %assign i 0 
    %rep    unroll
        movaps          xmm1, [r9 + 16*i]
        mulps           xmm1, xmm2,
        addps           xmm1, [r10 + 16*i]
        movaps          [r11 + 16*i], xmm1
    %assign i i+1 
    %endrep
    add             r9, 16*unroll
    add             r10, 16*unroll
    add             r11, 16*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret
Exterior answered 17/9, 2014 at 20:3 Comment(54)
@dolan I have two 32 byte loads, one 32 byte write, one 32-byte FMA, one 64-bit add, and one conditional branch which in total uses 4 fused micro-ops and 6/8 ports. What more instruction level parallelism can I do?Exterior
Just out of curiosity: how do you measure the percentages of peak bandwidth you mention in your post?Mispronounce
@rubenvb, it's done in the line double rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;. The percentage is 100*rate/peak. The peak is frequency*96 which in my case is 1.3*96=124.8 billion bytes/sec. The 96 is 32*2 byte reads + 32*1 byte write.Exterior
I think you should ask MysticialPolitics
A table of results can be found for SSE, AVX, and FMA at stackoverflow.com/questions/25921612/…. You can clearly see there the problem on Haswell.Exterior
In order for the FMA version to run at 100%, it needs to saturate all 8 ports on every cycle. (0+1 - 2xFMA, 2+3 - 2xload, 7+4, 1xstore, 5 - add, 6 - jmp). Furthermore, you have a total of 6 uops in the unfused domain and 4 uops in the fused domain. Haswell can only retire 4 uops per cycle, but it's not clear whether it's 4 in the fused or unfused domains. Even if we assume the former, Agner Fog said that it's basically impossible to sustain all 8 ports every cycle.Romonaromonda
Assuming the add/jmp are fused, you still need to sustain 7 of the 8 ports every cycle. The problem is that some of the instructions can go into multiple ports and the CPU has to pick optimally the right ones to avoid blocking others. Optimal packing is an NP-complete problem, not something I'd expect a processor to be able to do.Romonaromonda
@Mysticial, do you think this it is worth making this into a bounty question or would I just be wasting rep?Exterior
@Mysticial, I missed the part about having to saturate all 8 ports. That's a good point I need to read up on this. In the table I made above it uses six ports put there were a few variations I could choose.Exterior
You'd probably need someone from Intel to give you a definitive answer. All I'm saying is that I can only find 1 scheduling that could reach 100% (assuming the limit of 4 is for fused uops). But because there are so many ways to schedule it, the processor might not actually find the best one. For example, store needs 237 + 4. It has a choice between 2, 3, or 7. But it MUST pick 7 otherwise it will block a load. Likewise, a fused add/jmp can go into either 0 or 6. But it MUST pick 6 or it will block an FMA...Romonaromonda
Furthermore, if you issue two instructions with different latencies to the same port, they may finish on the same cycle. But the port can't output two things on the same cycle, so it bubbles. This isn't an issue here since the optimal scheduling has the same instructions going to the same ports. But as soon as the processor makes a wrong decision (such as putting add/jmp on port 0), it will start causing bubbles.Romonaromonda
@Mysticial, why does unrolling help. In fact I have to unroll a lot. I don't even get the peak unrolling 16 times. I don't know where the sweat spot is but I fully unrolled 256 times (8*256=2048) and that gets the highest efficiency I have seen (but it's still only about 93%).Exterior
Unrolling helps because you don't have an add and a jmp getting in the way. Hypothetically, each time the pair gets scheduled into port 0, it will block 2 FMAs. (once for the cycle it issues, and once for an earlier FMA that ends on the same cycle)Romonaromonda
@Mysticial, okay, that makes sense. I understand most of what you have said. I need to read through it more carefully and check Agner's manuals again. I'm going to make this a bounty question anyway. I'm not 100% convince this is not possible. Well I'm more or less convinced I can't do it the way I'm doing it. But I think it might be possible to better than 93% without unrolling a large number of times.Exterior
@Romonaromonda @Z boson Per this paper published at Hot Chips by Intel, on page 14 we have: Up to four compound ops per cycle allocate resources for out-of-order execution and are split into simple ops. Up to eight simple ops per cycle can be executed by heterogeneous execution ports. Once complete, up to four compound ops can be retired per cycle. Each Haswell core shares its execution resources between two threads of execution via Intel Hyperthreading. I take this as meaning that it has a throughput of 4 fused uops/clock cycleFourteen
@Romonaromonda @Z boson running out of ideas for what might be causing this. I can reproduce this issue even when replacing vfmadd231ps ymm1, ymm2, [rsi+rax] with vaddps ymm1, [rsi+rax], and even with vmovaps ymm1, [rsi+rax] (!). The latter proves the issue isn't due to dependencies but due to decoding or quirks of execution. The core isn't overloaded - It has 2 vector reads, 1 vector write, a scalar ALU op and a branch. Perhaps uop fusion doesn't occur. Interestingly, but probably not relevant here, is that a predicted-taken branch takes 1-2 CC whereas a predicted not-taken takes 0.5-1 CC.Fourteen
@IwillnotexistIdonotexist, yeah, I'm thinking the macro-fusing is not happening. That would explain why I'm getting about 64 bytes/cycle instead of 96 bytes/cycle. Otherwise I would expect an efficiency between 66% and 100%.Exterior
@IwillnotexistIdonotexist @Mysticial, good observation on vaddps. That's why I unrolled eight times here stackoverflow.com/questions/25774190/…. It's basically the same thing but using addition instead of fma. You could also try vmulps but it won't make a difference: vfmaddxxxps, vmulps, or vaddps all have the same result in performance. I mean they each get a little less than 64 bytes/cycle instead of 96 bytes/cycle.Exterior
@Zboson: IMHO macro-fusion is working. Simple test (reorder instructions so that there is something in between add and jne) shows noticeable performance drop.Childers
@EvgenyKluev, good observation. I had actually already observed that but did not jump to the right conclusion.Exterior
@EvgenyKluev, the fact that the efficiency is less than 66% tells me that the bandwidth is never more than 64 bytes/cycle. This mean it either reads 32*2 bytes per cycle or reads 32 and writes 32 per cycle. It never reads 32*2 bytes and writes 32 bytes in the same cycle. Otherwise I would think the efficiency would be above 66%. However, I can get above 90% by unrolling which shows that Haswell is clearly cable of reading 32*2 bytes and writing 32 bytes in the same cycle.Exterior
I noticed that running "triad" code in a short loop (1) works as if branch mispredicts are very costly and (2) shows very different performance for different memory access modes, so [off] has almost ideal performance (but is useless), and [r+r] is much slower. I have no explanation for all this, but knowing these facts allows some optimizations: optimal array size is about 16KB (which is slightly out of L1 for 3 arrays) and best access mode is [r+off]. This code has ~75% efficiency (if not unrolled). Compile it as in OP, link with ld.Childers
@EvgenyKluev, I'm observing the same thing in regards to [off] and [r+r]. I tried unrolling using [r+r] with the add but without the jump but it never breaks 66%. I only get better than 66% with [r+offset] (but not with [base+mult*index+offset]). That's strange because [r+offset] produces longer instructions than [r+r] (7 bytes vs. 6). And thanks for the code. I learned a lot by reading it (I'm new to x86 assembly and NASM). I'll give it a try soon.Exterior
@EvgenyKluev, feel free to write up an answer. I will upvote what you have now.Exterior
@Romonaromonda and according to this source (realworldtech.com/haswell-cpu/2), there are one complex decoder and three simple decoders in Haswell. The complex one can spit out 4 fused uops per cycle whereas the simple ones only 1. But this loop should be fetching only from the uop decode queue as it is so small.Fourteen
@IwillnotexistIdonotexist, the way I understand it now is that since the loop fits in the μop-cache that we don't need to worry about theses decoders. Additionally, the decoders can only handle 16 byte instruction fetch per cycle whereas the μop-cache can handle 32 bytes/cycle (which you can see in the diagram). So I don't think Agner's statement about the last of the four decoders applies in this case. If you read section 12.9 "Same example on Sandy Bridge" in Agner's optimizing assembly manual I think it explains this (or at least that's what I read from it).Exterior
@EvgenyKluev, I finally ran your code. I did rcx, 2539062 ; 1.3GHz -> 1 sec/clock. This finishes in 1.375 seconds. That's about 73%. I like your solution skipping C. How do you time it? Given an executable evgengy I did time ./evgengy. Now I need some time to diagnosis your code further.Exterior
@EvgenyKluev, what do you mean by [off] in your code? If I just read and write to the same offset it takes 5.6 seconds (which I think is due to read after write stalls).Exterior
I also use time utility to measure execution time. Only before computing the inverse I subtract 0.05 sec. for load+warmup+etc. To get the exact value to be subtracted just run the program with empty loop (only add/jne). By [off] I mean offset. Definitely I write to some [off] but then read from [off+64] to avoid stalls. BTW it's possible to improve unroll method and gain 2..3% more for performance. This code gives 88% if unrolled 2*2 times and 93% if unrolled 4*2 times.Childers
@EvgenyKluev, got it, I understood it was offset but did not think to try [offset+64] on the write. Now I see what you mean. Though I get about 90% doing that 1/(1.173-0.05). Sounds like you're making good progress. I'm not completely opposed to some unrolling. I just want to understand why it's necessary and if it's possible to do better than 93% or so.Exterior
@EvgenyKluev, BTW, Intel's MKL on matrix mult get's 97% efficiency for larger n on AVX. However, last time I tried it on haswell with FMA it got less than 90%. I infer that that means that even Intel is struggling to get high efficiency on Haswell. In other bandwidth tests that I do doing reductions (e.g. sum*=x[i]) I get only about 70%.Exterior
@EvgenyKluev, I tried your 4*2 unroll code. It gets 95% (subtracting 0.05 from the time). Nice work!Exterior
It seems this measurement method is not exact. At least for performances close to 100%. It's better to multiply constant by 10 and forget about warm-up time. This way I get not 95% or 93% but only 90%.Childers
@EvgenyKluev, yeah, I suspected that. I would not use it for precision but nevertheless it's progress. But it should be easy to put your code into my C code. That should be more precise. Just ignore the x, y, z registers (rdi, rsi, rdx) and use the repeat register rdc.Exterior
@Romonaromonda Shoutout - problem solved. Mysticial, you're right in a wierd way - the store obstinately refuses to use Port 7.Fourteen
@Romonaromonda In case you two were interested in the analysis for the 32x unrolled loop, it's below. Note how the store now goes to Port 7 and how no less than 6 units are now the joint throughput bottlenecks.Fourteen
@Romonaromonda Also I tried changing the unrolled version to have other unroll factors. At 1x the loop takes 1.75 CC, the store is split and the bottleneck is the frontend. At 2x the loop takes 2.5 CC, the stores aren't split and the bottleneck is still the frontend. At 4x, 8x, 16x and up the loop takes 4/8/16 CC, the stores are not split, and the bottleneck is jointly the frontend, Port 2 and 3's AGU and data loader, Port 4 and Port 7. The minimum unroll factor is therefore 4.Fourteen
@IwillnotexistIdonotexist That's pretty awesome. I guess it's still the case that it helps to give the processor enough "room" to reach steady-state.Romonaromonda
@IwillnotexistIdonotexist, unroll16 was probably not a good choice of name. It should have been unrolln. It's historic because I was unrolling manually until I started using the NASM macros for that (assembly is awesome). Just before I posted I must have tried unroll 32 and forgot to switch back (the SSE and AVX version are still 16).Exterior
@IwillnotexistIdonotexist, what do you mean by "the frontend"?Exterior
@IwillnotexistIdonotexist, I think the efficiency with unrolling is eff = (3*unroll)/(3*unroll+4). So for unrolling 16 times the efficiency should be 92%. Eight times should be about 86%. Evgeny got 87% unrolling eight times at stackoverflow.com/questions/25774190/…. So I'm not totally sure my formula is correct but it seems about right.Exterior
@Z boson I'm referring to the output of the IACA tool, but the "frontend" means everything feeding the execution units, i.e. in front of them. The frontend being the bottleneck means that the uop cache or reorder buffer cannot keep the execution ports busy.Fourteen
@IwillnotexistIdonotexist, ahh..now I see where this "frontend" is comming from. IACA reports "Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU". It gives you a summary in one line of the bottlenecks. Cool! I'm going to have to try IACA out.Exterior
@Zboson When I first noticed your question as a result of your huge bounty, I knew this tool would be useful (I've used it before), but I had forgotten its name! It took me two days to jog my memory. Anyways, to make IACA work with nasm you just need to put a particular magic sequence of bytes before the first instruction of a loop, and a different magic sequence of bytes immediately after the backwards branch. The program won't run anymore, but IACA picks up the magic and analyzes it.Fourteen
@IwillnotexistIdonotexist, I have a silly question. I'm trying out IACA now. I can put the markers into my intrinsic functions and get results I want but I'm not sure where to put them with my NASM code. When I put them between the NASM function triad_fma_asm_repeat it just sees the instruction for a function call. How did you do this?Exterior
@Zboson I added the start marker immediately before the first instruction of the innermost loop and immediately after the conditional branch backwards of same. IACA is somewhat limited in that it can only analyze an innermost loop (and I'm not even sure it will handle branches within) but then that is usually what one is interested in.Fourteen
@Zboson You can do it in NASM by saying db 0Dh, 0Fh, ...... at the relevant spot.Fourteen
@IwillnotexistIdonotexist, oh, I got it. You used the instructions in the header and inserted them directly into the NASM code. It's working now!Exterior
yeah, ;START_MARKER mov ebx, 111 db 0x64, 0x67, 0x90 ;END_MARKER mov ebx, 222 db 0x64, 0x67, 0x90Exterior
@IwillnotexistIdonotexist, there are two problems. The store does not use port 7 with [r+r] but even when the store is [r+offset] the loads don't do micro-op fusion with [r+r]. The only solution which get's 4 microps per cycle without unrolling is to use static arrays with [r+offset].Exterior
@Zboson Have you tried for kicks [r+r] for loads as we're doing right now, but make only the destination array static? Also, you're actually sustaining 5.125 uops/CC right now: 164/32=5.125.Fourteen
@IwillnotexistIdonotexist, yes I have tried [r+r] with the loads and a static array only for the store with [r+offset]. The fma and load does not fuse still. So even after you fix the store there is still the problem of fusing the FMA and load. So I think the only solution is [r+offset] for the loads and stores. I'll try and look into that today.Exterior
why's the first code snippet have a wrong prototype? it doesn't even compile, the return type is void not floatUke
@cat, it compiles fine. Maybe it gives a warning. You are correct that it should be void.Exterior
F
31

IACA Analysis

Using IACA (the Intel Architecture Code Analyzer) reveals that macro-op fusion is indeed occurring, and that it is not the problem. It is Mysticial who is correct: The problem is that the store isn't using Port 7 at all.

IACA reports the following:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles       Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.5  | 1.5    1.0  | 1.5    1.0  | 1.0  | 0.0  | 1.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
|   2    | 0.5       | 0.5 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | add rax, 0x20
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffffec
Total Num Of Uops: 6

In particular, the reported block throughput in cycles (1.5) jives very well with an efficiency of 66%.

A post on IACA's own website about this very phenomenon on Tue, 03/11/2014 - 12:39 was met by this reply by an Intel employee on Tue, 03/11/2014 - 23:20:

Port7 AGU can only work on stores with simple memory address (no index register). This is why the above analysis doesn't show port7 utilization.

This firmly settles why Port 7 wasn't being used.

Now, contrast the above with a 32x unrolled loop (it turns out unroll16 shoudl actually be called unroll32):

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 32.00 Cycles       Throughput Bottleneck: PORT2_AGU, Port2_DATA, PORT3_AGU, Port3_DATA, Port4, Port7

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 16.0   0.0  | 16.0 | 32.0   32.0 | 32.0   32.0 | 32.0 | 2.0  | 2.0  | 32.0 |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x20]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x20]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x20], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x40]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x40]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x40], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x60]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x60]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x60], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x80]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x80]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x80], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xa0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xa0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xa0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xc0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xc0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xc0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xe0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xe0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xe0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x100]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x100]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x100], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x120]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x120]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x120], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x140]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x140]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x140], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x160]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x160]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x160], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x180]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x180]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x180], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x200]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x200]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x200], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x220]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x220]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x220], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x240]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x240]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x240], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x260]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x260]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x260], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x280]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x280]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x280], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x300]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x300]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x300], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x320]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x320]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x320], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x340]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x340]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x340], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x360]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x360]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x360], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x380]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x380]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x380], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3e0], ymm1
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r9, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | add r10, 0x400
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r11, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp r9, rcx
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffcaf
Total Num Of Uops: 164

We see here micro-fusion and correct scheduling of the store to Port 7.

Manual Analysis (see edit above)

I can now answer the second of your questions: Is this possible without unrolling and if so how can it be done?. The answer is no.

I padded the arrays x, y and z to the left and right with plenty of buffer for the below experiment, and changed the inner loop to the following:

.L2:
vmovaps         ymm1, [rdi+rax] ; 1L
vmovaps         ymm0, [rsi+rax] ; 2L
vmovaps         [rdx+rax], ymm2 ; S1
add             rax, 32         ; ADD
jne             .L2             ; JMP

This intentionally does not use FMA (only loads and stores) and all load/store instructions have no dependencies, since there should therefore be no hazards whatever preventing their issue into any execution ports.

I then tested every single permutation of the first and second loads (1L and 2L), the store (S1) and the add (A) while leaving the conditional jump (J) at the end, and for each of these I tested every possible combination of offsets of x, y and z by 0 or -32 bytes (to correct for the fact that reordering the add rax, 32 before one of the r+r indexes would cause the load or store to target the wrong address). The loop was aligned to 32 bytes. The tests were run on a 2.4GHz i7-4700MQ with TurboBoost disabled by means of echo '0' > /sys/devices/system/cpu/cpufreq/boost under Linux, and using 2.4 for the frequency constant. Here are the efficiency results (maximum of 24):

Cases: 0           1           2           3           4           5           6           7
       L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   
       -0  -0  -0  -0  -0  -32 -0  -32 -0  -0  -32 -32 -32 -0  -0  -32 -0  -32 -32 -32 -0  -32 -32 -32
       ________________________________________________________________________________________________
12SAJ: 65.34%      65.34%      49.63%      65.07%      49.70%      65.05%      49.22%      65.07%
12ASJ: 48.59%      64.48%      48.74%      49.69%      48.75%      49.69%      48.99%      48.60%
1A2SJ: 49.69%      64.77%      48.67%      64.06%      49.69%      49.69%      48.94%      49.69%
1AS2J: 48.61%      64.66%      48.73%      49.71%      48.77%      49.69%      49.05%      48.74%
1S2AJ: 49.66%      65.13%      49.49%      49.66%      48.96%      64.82%      49.02%      49.66%
1SA2J: 64.44%      64.69%      49.69%      64.34%      49.69%      64.41%      48.75%      64.14%
21SAJ: 65.33%*     65.34%      49.70%      65.06%      49.62%      65.07%      49.22%      65.04%
21ASJ: Hypothetically =12ASJ
2A1SJ: Hypothetically =1A2SJ
2AS1J: Hypothetically =1AS2J
2S1AJ: Hypothetically =1S2AJ
2SA1J: Hypothetically =1SA2J
S21AJ: 48.91%      65.19%      49.04%      49.72%      49.12%      49.63%      49.21%      48.95%
S2A1J: Hypothetically =S1A2J
SA21J: Hypothetically =SA12J
SA12J: 64.69%      64.93%      49.70%      64.66%      49.69%      64.27%      48.71%      64.56%
S12AJ: 48.90%      65.20%      49.12%      49.63%      49.03%      49.70%      49.21%*     48.94%
S1A2J: 49.69%      64.74%      48.65%      64.48%      49.43%      49.69%      48.66%      49.69%
A2S1J: Hypothetically =A1S2J
A21SJ: Hypothetically =A12SJ
A12SJ: 64.62%      64.45%      49.69%      64.57%      49.69%      64.45%      48.58%      63.99%
A1S2J: 49.72%      64.69%      49.72%      49.72%      48.67%      64.46%      48.95%      49.72%
AS21J: Hypothetically =AS21J
AS12J: 48.71%      64.53%      48.76%      49.69%      48.76%      49.74%      48.93%      48.69%

We can notice a few things from the table:

  • Several plateaux of results, but two main ones only: Just under 50% and around 65%.
  • L1 and L2 can permute freely between each other without affecting the result.
  • Offsetting the accesses by -32 bytes can change efficiency.
  • The patterns we are interested in (Load 1, Load 2, Store 1 and Jump with the Add anywhere around them and the -32 offsets properly applied) are all the same, and all in the higher plateau:
    • 12SAJ Case 0 (No offsets applied), with efficiency 65.34% (the highest)
    • 12ASJ Case 1 (S-32), with efficiency 64.48%
    • 1A2SJ Case 3 (2L-32, S-32), with efficiency 64.06%
    • A12SJ Case 7 (1L-32, 2L-32, S-32), with efficiency 63.99%
  • There always exists at least one "case" for every permutation that allows execution at the higher plateau of efficiency. In particular, Case 1 (where S-32) seems to guarantee this.
  • Cases 2, 4 and 6 guarantee execution at the lower plateau. They have in common that either or both of the loads are offset by -32 while the store isn't.
  • For cases 0, 3, 5 and 7, it depends on the permutation.

Whence we may draw at least a few conclusions:

  • Execution ports 2 and 3 really don't care which load address they generate and load from.
  • Macro-op fusion of the add and jmp appears unimpacted by any permutation of the instructions (in particular under Case 1 offsetting), leading me to believe that @Evgeny Kluev's conclusion is incorrect: The distance of the add from the jne does not appear to impact their fusion. I'm reasonably certain now that the Haswell ROB handles this correctly.
    • What Evgeny was seeing (Going from 12SAJ with efficiency 65% to the others with efficiency 49% within Case 0) was an effect due solely to the value of the addresses loaded and stored from, and not due to an inability of the core to macro-fuse the add and branch.
    • Further, macro-op fusion must be occurring at least some of the time, since the average loop time is 1.5 CC. If macro-op fusion did not occur this would be 2CC minimum.
  • Having tested all valid and invalid permutations of instructions within the not-unrolled loop, we've seen nothing higher than 65.34%. This answers empirically with a "no" the question of whether it is possible to use the full bandwidth without unrolling.

I will hypothesize several possible explanations:

  • We're seeing some wierd perversion due to the value of the addresses relative to each other.
    • If so then there would exist a set of offsets of x, y and z that would allow maximum throughput. Quick random tests on my part seem not to support this.
  • We're seeing the loop run in one-two-step mode; The loop iterations alternate running in one clock cycle, then two.

    • This could be macro-op fusion being affected by the decoders. From Agner Fog:

      Fuseable arithmetic/logic instructions cannot be decoded in the last of the four decoders on Sandy Bridge and Ivy Bridge processors. I have not tested whether this also applies to the Haswell.

    • Alternately, every other clock cycle an instruction is issued to the "wrong" port, blocking the next iteration for one extra clock cycle. Such a situation would be self-correcting in the next clock cycle but would remain oscillatory.
      • If somebody has access to the Intel performance counters, he should look at the events UOPS_EXECUTED_PORT.PORT_[0-7]. If oscillation is not occuring, all ports that are used will be pegged equally during the relevant stretch of time; Else if oscillation is occuring, there will be a 50% split. Especially important is to look at the ports Mystical pointed out (0, 1, 6 and 7).

And here's what I think is not happening:

  • I don't believe that the fused arithmetic+branch uop is blocking execution by going to port 0, since predicted-taken branches are sent exclusively to port 6 (see Agner Fog's Instruction Tables under Haswell -> Control transfer instructions). After a few iterations of the loop above, the branch predictor will learn that this branch is a loop and always predict as taken.

I believe this is a problem that will be solved with Intel's performance counters.

Fourteen answered 22/9, 2014 at 2:54 Comment(51)
Very good answer (+1). And good observation on the fusing when moving the add. I guess that's why I hesitated on jumping to a conclusion. Your conclusion for [r+r] is correct. However, Evgeny claims that using static arrays and [absolute address + index] breaks the 66% barrier without unrolling. If you want access to the performance counters you can use Agner Fog's test programs. It has it's own device driver for Linux and Windows to do this and is well documented. I should be using this.Exterior
@Zboson Looking at addressing modes now. Having some success: I can reach about 70% when using an extra register and add and doing [r] for the store.Fourteen
Awesome, I'm looking forward to seeing your results. It's strange that [r+r] can't break the 66% barrier.Exterior
What was the difference between rdx and other base registers (rsi, rdi) in your tests? In case it was a multiple of 4096, isn't it possible to explain all the results close to 49% by false sharing? (See "L1 memory bandwidth: 50% drop in efficiency..." for details). Also it should be noted that macro-fused instructions must be adjacent in the instruction stream (see section 3.4.2.2 of Intel Optimization manual, also mentioned in Agner's manuals).Childers
@EvgenyKluev, I understand your point about keeping the the macro-fused instructions being adjacent. However, Agner writes "The programmer should keep any fuseable arithmetic instruction together" but he does not write "must keep". So I assumed maybe it was not a requirement but made the fusion more likely. Let me check the Intel manuals.Exterior
@EvgenyKluev, all I found in the Intel manual is "The second instruction (adjacent in the instruction stream) should be a conditional branch." I guess that means they must be adjacent.Exterior
@EvgenyKluev, what about his 12ASJ case 1 (starting from 0), it moves the add between the store and jump and still gets 65%. So that implies that the fusing is not effected (which could also mean it's never happening). I agree with your that the 50% drops could just be due to false sharing.Exterior
@Zboson: just two simple tests. Change addressing mode to [offset] and lengthen the loop to avoid too many mispredicts. Efficiency is ~95%. This cannot be explained if you suppose that fusing is never happening. Now exchange add and last vmovaps. Efficiency decreases to ~81%. Here obviously there is no fusing.Childers
@EvgenyKluev, okay, that makes sense. I'll try and test your code tonight. Are you going to post an answer?Exterior
@Zboson: not decided yet. I hope we'll find something with help of performance counters. Or may be (with a bit of luck) your question attracts attention of somebody who knows definite answer.Childers
@EvgenyKluev @Zboson It is true that in Case 0, rdx , rdi and rsi are all offset by 8192. I indeed thought it likely the plateau at 50% is exactly the phenomenon in that question you cite. But by trying more varied offsets than 0 or -32 I've been unable to improve efficiency to more than 66%.Fourteen
@EvgenyKluev @Zboson The reason I believe fusion is at least partially unimpacted is that if it were not, the loop time would be 2 cycles minimum; And yet with an efficiency of 65% in Case 1 for all cases (not just ***AJ), the loop time is 1.5 cycles, meaning fusion must be occurring at least half the time.Fourteen
@EvgenyKluev Is anything known about how many times a register can be used per cycle? rax is currently read 4 times and written 1 time per iteration, an incredibly heavy use of it for a loop that is to take 1 CC. Perhaps Haswell cannot forward the result of the add immediately to so many places?Fourteen
Very nice idea! I never heard of such a limitation. But tests show some improvement after substituting rax to other register in one instruction.Childers
@IwillnotexistIdonotexist, there was a problem with "register read stalls" in processors before SB. For example see Agner's Pentium M example for this processor. However, this problem went away on SB/IB. Maybe it's back. In fact the way Agner got the max throughput for this function on the Pentium M was by doing your 12ASJ by moving the add before the store and subtracting 16 (SSE) in the offset of the store. I tried this on Haswell but I got the same results you did, i.e. never better than 66%.Exterior
Good work again! So the 1.5 on port 2 and 3 is the store that should have used port 7 being split between port 2 and 3.Exterior
@Zboson I have no idea at all why the frontend would split between ports 2 and 3 the store when there's a unit dedicated to it on port 7. I guess this falls firmly under "quirks of execution".Fourteen
As far as I can tell the unrolled version still needs to choose port 7 in order get the full bandwith. It's interesting that it succeeds at this when not using the add and jump. It would be interesting to see the counters for the unrolled case. The most important lesson I learned from this is to stop philosophizing and look at the counters.Exterior
@Zboson I will run IACA on the unrolled loop ASAP for completeness. Just need to get back to my computer.Fourteen
There is other way to interpret IACA Analysis: some other reasons down performance below 66%, and as a consequence port 7 is not needed, so it is turned off completely for energy saving. Having exact zero in this port supports this version. One of mentioned reasons is very expensive branch mispredict because of long-latency instructions + relatively short loop. Other(s) are still unknown.Childers
@EvgenyKluev But this is a loop; It will get mispredicted at most twice. Possibly once at the beginning (when it first encounters the branch backwards) and once at the end (when the loop is broken out of). In all other cases the branch predictor will function correctly, and learn that this branch has so-called "loop behaviour". Admittedly what you suggest is entirely possible, but this kind of loop is very well predicted in general.Fourteen
@EvgenyKluev Also on the topic of energy saving, at present the processor alternates between sending the FMA to port 0 and port 1. It would save more power if it did that only on one or the other, since it could have gated the other port. And we know that port 0 and 1 each have a throughput of 1 FMA/CC, so there's no excuse on the part of the scheduler for not issuing the FMA into the same port all the time.Fourteen
Most likely this loop is mispredicted only once - at the end. (I think start of the loop is predicted correctly). But this single mispredict is very costly: 15..20 clocks minimum mispredict penalty (according to Agner Fog). The loop needs 256 clocks to execute all instructions (for 8KB arrays), so best possible performance should be approximately 256/(256+15)=94%. Maybe less, because of large instructions' latencies. No idea why it does not turn off one of FMAs. Still 0 instructions scheduled for port 7 looks like energy saving. I cannot imagine that scheduler does it "by mistake".Childers
@EvgenyKluev Well, we now know what is going on (With Intel's own tool) but why is something we cannot know without an Intel employee peeking into here.Fourteen
The total microps is the total unfused microps (but macro fusion still counts as one). So the total is (1+2+2)*32+4=164. However, the number of clock cycles is (1+1+1)*32+4=100. So the best efficiency is 96/100=96% unrolling 32 times.Exterior
"why isn't port 7 used" is easy to answer: port 7 can only handle "simple" AGU operations (base + immediate offset, IIRC). It can't do base + register offset. You can get around this by using the store address as the offset for the load operands.Palawan
@StephenCanon, woah..."It can't do base + register offset" I wish someone had said this much earlier on!!! Do you have a source for this?! What exactly to you mean "You can get around this by using the store address as the offset for the load operands"?Exterior
@StephenCanon Now that would explain why I got an improvement of about 7% in the rolled loop by maintaining a separate pointer in r9 for the store and incrementing it in parallel with the add by a lea r9, [r9+32]. Where did you learn about this limitation of the Port 7 AGU?Fourteen
@IwillnotexistIdonotexist, Intel's optimization manual uses the expression Simple_AGU on port7. That's all I have found so far. I'm still trying to find the definition of Simple_AGU.Exterior
@Zboson He means that because the Port 7 AGU that must be used by the stores is more primitive than the Port 2 and 3 AGUs, the code must be optimized s.t. the store address has the simplest expression ([r+offset]) while the load addresses are based off it (r+r*1+offset). IOW, instead of using r+r for all accesses, design the pointers s.t. the store unit has the privilege of r+offset onlyFourteen
@IwillnotexistIdonotexist, I understand that. What I mean is I don't know how to do this with my code except for static arrays using [absolute 32-bit address + r]. And then incrementing r by 32. If you do the loads with [r+r] and the stores with [r+offset] then you need another instruction and this can't be done with 4 cycles. Thought it might be possible to do better than 66% this way.Exterior
@Zboson Actually, by your own scheduling in the question, there is room in ports 1 and 5, either of which (but in particular 5) is capable of arithmetic like an add or lea (see my comment above in response to Stephen), either of which is capable of incrementing a register by a small constant. There is just enough room in the rolled loop, and unrolling adds more.Fourteen
@IwillnotexistIdonotexist, there is ports available but no more microps. Another add pushes it to five. That's better than six but not four (this is beginning to sound like a Monte Python sketch). So at best you could get 80% efficiency without unrolling.Exterior
@Zboson I believe that this is where the unrolling comes in; At 4x and more there are enough timeslots in Port 5 that all pointer increments in the unrolled loop can happen.Fourteen
@IwillnotexistIdonotexist, what's strange is that Evgeny used static arrays using [offset + r] but still did not get high efficiency. I would think that this would use port 7 and get better than 90% efficiency without unrolling.Exterior
iaca says that this ought to work (88% efficiency with 4x unrolling); it's limited by the front-end. I don't have my HSW handy to test on, unfortunately.Palawan
@IwillnotexistIdonotexist, I don't get an improvement using vmovaps [r9], ymm1; lea r9, [r9+32]. In fact it's only about 52% efficient. However, I do get an improvement using a static array for the write vmovaps [dst_end+rax], ymm1. The efficiency is about 68%.Exterior
@IwillnotexistIdonotexist, I used the store address as the index in the loads with unrolling and it's definitely better. I get about 94% now unrolling 256 times. Unrolling four times gets about 77%.Exterior
@Zboson Interesting; I guess that IACA accounts only for the perfectly optimal case (No false sharing, latencies or other problems) when doing its analysis. At 4x unroll IACA is predicting perfect performance already.Fourteen
But not unrolling only gets 56%. I think because I have to do a add, cmp, and jcc instead of just an add and jcc. The only way to improve the code without unrolling is using a static array for the store but it's only 68% then anyway.Exterior
opps...fully unrolling is a special case. I don't need any add/cmps. But in any case my integration time is to short so the error on the efficiency is a few percent: 92% seems to be about the best that can be achieved on average.Exterior
@Zboson I know this is gonna make you laugh but... This same problem, question and answer were noticed, asked and answered on IACA's own website 6 months ago! See the comment on Tue, 03/11/2014 - 12:39.Fourteen
@IwillnotexistIdonotexist, that's spooky! You're totally right! See the comment immediately after "Port7 AGU can only work on stores with simple memory address (no index register)."Exterior
@IwillnotexistIdonotexist, and the next comment is "with an explicit (fast) lea to produce the base register on Port 1 or 5 and using that in the store improved the speed significantly, I wonder why the compiler didn't think of that when using -xHost. I even get ~94B/c in a simple load/store benchmark now." I'm not sure exactly what he did.Exterior
@IwillnotexistIdonotexist, I updated my question with a summary of our finding here. Let me know what you think.Exterior
@Zboson Cool! Looks good. For my part I wonder if I should make a question & answer on IACA and create its tag here.Fourteen
@IwillnotexistIdonotexist, I'm not sure exactly what you mean by " make a question & answer on IACA and create its tag here" but it sounds like a great idea to me. IACA from my point of view is gold and well worth the 500 rep I spent.Exterior
@IwillnotexistIdonotexist, haha, that's sneaky! But I'm glad you did it. Do you realize that you are on the first page of highest earning points for the week right now? Click on users and rank by week.Exterior
Great, now write one explaining how to use a lightweight sampling profiler to figure out which loops to focus on in IACA =)Palawan
Regarding the various uop limits: You can execute a uop on all 8 ports during a single cycle if appropriate uops are in the 192 entry ROB (reorder buffer) and all dependencies are satisfied. But there is a front end limit of 4 (unfused) uops per cycle that can enter the the ROB. This applies even if the uops are coming from the tiny loop buffer or the ~1000 entry decoded uop buffer. There is an additional back end limit of 4 (fused) uops that can be retired per cycle. As a result, sustained throughput cannot exceed 4 uops per cycle. Unrolling loops helps iff it gets under these limits.Pena
Missed the edit window on a mistake: the decoded instruction buffer stores fused uops, so everything coming from it should be counted in the fused domain. Good diagram with per cycle limits here: pc.watch.impress.co.jp/video/pcw/docs/601/161/p21.pdfPena

© 2022 - 2024 — McMap. All rights reserved.