I'm attempting to obtain full bandwidth in the L1 cache for the following function on Intel processors
float triad(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
for(int i=0; i<n; i++) {
z[i] = x[i] + k*y[i];
}
}
This is the triad function from STREAM.
I get about 95% of the peak with SandyBridge/IvyBridge processors with this function (using assembly with NASM). However, using Haswell I only achieve 62% of the peak unless I unroll the loop. If I unroll 16 times I get 92%. I don't understand this.
I decided to write my function in assembly using NASM. The main loop in assembly looks like this.
.L2:
vmovaps ymm1, [rdi+rax]
vfmadd231ps ymm1, ymm2, [rsi+rax]
vmovaps [rdx+rax], ymm1
add rax, 32
jne .L2
It turns out in Agner Fog's Optimizing Assembly manual in examples 12.7-12.11 he does almost the same thing (but for y[i] = y[i] +k*x[i]
) for the Pentium M, Core 2, Sandy Bridge, FMA4, and FMA3. I managed to reproduce his code more or less on my own (actually he has a small bug in the FMA3 example when he broadcasts). He gives instruction size counts, fused ops , execution ports in tables for each processor except for FMA4 and FMA3. I have tried to make this table myself for FMA3.
ports
size μops-fused 0 1 2 3 4 5 6 7
vmovaps 5 1 ½ ½
vfmadd231ps 6 1 ½ ½ ½ ½
vmovaps 5 1 1 1
add 4 ½ ½
jne 2 ½ ½
--------------------------------------------------------------
total 22 4 ½ ½ 1 1 1 0 1 1
Size refers to the instruction length in bytes. The reason the add
and jne
instructions have half a μop is they get fused into one macro-op (not to be confused with μop fusion which still uses multiple ports) and only need port 6 and one μop. The . In order to be consistent with Agner Fog's tables and since I think it makes more sense to say an instruction which can go to different ports equally goes to each one 1/2 of the time, I assigned 1/2 for the ports vfmadd231ps
instruction can use port 0 or port 1. I chose port 0. The load vmovaps
can use port 2 or 3. I chose 2 and had vfmadd231ps
use port 3.vmovaps
and vmadd231ps
can go to.
Based on this table and the fact that all Core2 processors can do four μops every clock cycle it appears this loop should be possible every clock cycle but I have not managed to obtain it. Can somebody please explain to me why I can't get close to the peak bandwidth for this function on Haswell without unrolling? Is this possible without unrolling and if so how can it be done? Let me be clear that I'm really trying to maximize the ILP for this function (I don't only want maximum bandwidth) so that's the reason I don't want to unroll.
Edit: Here is an update since Iwillnotexist Idonotexist showed using IACA that the stores never use port 7. I managed to break the 66% barrier without unrolling and do this in one clock cycle every iteration without unrolling(theoretically). Let's first address the store problem.
Stephen Canon mentioned in at comment that the Address Generation Unit (AGU) in port 7 can only handle simple operations such as [base + offset]
and not [base + index]
. In the Intel optimization reference manual the only thing I found was a comment on port7 which says "Simple_AGU" with no definition of what simple means. But then Iwillnotexist Idonotexist found in the comments of IACA that this problem was already mentioned six months ago in which an employee at Intel wrote on 03/11/2014:
Port7 AGU can only work on stores with simple memory address (no index register).
Stephen Canon suggests "using the store address as the offset for the load operands." I have tried this like this
vmovaps ymm1, [rdi + r9 + 32*i]
vfmadd231ps ymm1, ymm2, [rsi + r9 + 32*i]
vmovaps [r9 + 32*i], ymm1
add r9, 32*unroll
cmp r9, rcx
jne .L2
This indeed causes the store to use port7. However, it has another problem which is that the the vmadd231ps
does not fuse with the load which you can see from IACA. It also needs additionally the cmp
instruction which my original function did not. So the store uses one less micro-op but the cmp
(or rather then add
since the cmp
macro fuses with the jne
) needs one more. IACA reports a block throughput of 1.5. In practice this only get about 57% of the peak.
But I found a way to get the vmadd231ps
instruction to fuse with the load as well. This can only be done using static arrays with addressing [absolute 32-bit address + index] like this. Evgeny Kluev original suggested this.
vmovaps ymm1, [src1_end + rax]
vfmadd231ps ymm1, ymm2, [src2_end + rax]
vmovaps [dst_end + rax], ymm1
add rax, 32
jl .L2
Where src1_end
, src2_end
, and dst_end
are the end addresses of static arrays.
This reproduces the table in my question with four fused micro-ops that I expected. If you put this into IACA it reports a block throughput of 1.0. In theory this should do as well as the SSE and AVX versions. In practice it gets about 72% of the peak. That breaks the 66% barrier but it's still a long ways from the 92% I get unrolling 16 times. So on Haswell the only option to get close to the peak is to unroll. This is not necessary on Core2 through Ivy Bridge but it is on Haswell.
End_edit:
Here is the C/C++ Linux code to test this. The NASM code is posted after the C/C++ code. The only thing you have to change is the frequency number. In the line double frequency = 1.3;
replace 1.3 with whatever the operating (not nominal) frequency of your processors is (which in case for a i5-4250U with turbo disabled in the BIOS is 1.3 GHz).
Compile with
nasm -f elf64 triad_sse_asm.asm
nasm -f elf64 triad_avx_asm.asm
nasm -f elf64 triad_fma_asm.asm
g++ -m64 -lrt -O3 -mfma tests.cpp triad_fma_asm.o -o tests_fma
g++ -m64 -lrt -O3 -mavx tests.cpp triad_avx_asm.o -o tests_avx
g++ -m64 -lrt -O3 -msse2 tests.cpp triad_sse_asm.o -o tests_sse
The C/C++ code
#include <x86intrin.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#define TIMER_TYPE CLOCK_REALTIME
extern "C" float triad_sse_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_sse_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_avx_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_avx_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_fma_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_fma_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);
#if (defined(__FMA__))
float triad_fma_repeat(float *x, float *y, float *z, const int n, int repeat) {
float k = 3.14159f;
int r;
for(r=0; r<repeat; r++) {
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=0; i<n; i+=8) {
_mm256_store_ps(&z[i], _mm256_fmadd_ps(k4, _mm256_load_ps(&y[i]), _mm256_load_ps(&x[i])));
}
}
}
#elif (defined(__AVX__))
float triad_avx_repeat(float *x, float *y, float *z, const int n, int repeat) {
float k = 3.14159f;
int r;
for(r=0; r<repeat; r++) {
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=0; i<n; i+=8) {
_mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
}
}
}
#else
float triad_sse_repeat(float *x, float *y, float *z, const int n, int repeat) {
float k = 3.14159f;
int r;
for(r=0; r<repeat; r++) {
int i;
__m128 k4 = _mm_set1_ps(k);
for(i=0; i<n; i+=4) {
_mm_store_ps(&z[i], _mm_add_ps(_mm_load_ps(&x[i]), _mm_mul_ps(k4, _mm_load_ps(&y[i]))));
}
}
}
#endif
double time_diff(timespec start, timespec end)
{
timespec temp;
if ((end.tv_nsec-start.tv_nsec)<0) {
temp.tv_sec = end.tv_sec-start.tv_sec-1;
temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
} else {
temp.tv_sec = end.tv_sec-start.tv_sec;
temp.tv_nsec = end.tv_nsec-start.tv_nsec;
}
return (double)temp.tv_sec + (double)temp.tv_nsec*1E-9;
}
int main () {
int bytes_per_cycle = 0;
double frequency = 1.3; //Haswell
//double frequency = 3.6; //IB
//double frequency = 2.66; //Core2
#if (defined(__FMA__))
bytes_per_cycle = 96;
#elif (defined(__AVX__))
bytes_per_cycle = 48;
#else
bytes_per_cycle = 24;
#endif
double peak = frequency*bytes_per_cycle;
const int n =2048;
float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
char *mem = (char*)_mm_malloc(1<<18,4096);
char *a = mem;
char *b = a+n*sizeof(float);
char *c = b+n*sizeof(float);
float *x = (float*)a;
float *y = (float*)b;
float *z = (float*)c;
for(int i=0; i<n; i++) {
x[i] = 1.0f*i;
y[i] = 1.0f*i;
z[i] = 0;
}
int repeat = 1000000;
timespec time1, time2;
#if (defined(__FMA__))
triad_fma_repeat(x,y,z2,n,repeat);
#elif (defined(__AVX__))
triad_avx_repeat(x,y,z2,n,repeat);
#else
triad_sse_repeat(x,y,z2,n,repeat);
#endif
while(1) {
double dtime, rate;
clock_gettime(TIMER_TYPE, &time1);
#if (defined(__FMA__))
triad_fma_asm_repeat(x,y,z,n,repeat);
#elif (defined(__AVX__))
triad_avx_asm_repeat(x,y,z,n,repeat);
#else
triad_sse_asm_repeat(x,y,z,n,repeat);
#endif
clock_gettime(TIMER_TYPE, &time2);
dtime = time_diff(time1,time2);
rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
printf("unroll1 rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
clock_gettime(TIMER_TYPE, &time1);
#if (defined(__FMA__))
triad_fma_repeat(x,y,z,n,repeat);
#elif (defined(__AVX__))
triad_avx_repeat(x,y,z,n,repeat);
#else
triad_sse_repeat(x,y,z,n,repeat);
#endif
clock_gettime(TIMER_TYPE, &time2);
dtime = time_diff(time1,time2);
rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
printf("intrinsic rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
clock_gettime(TIMER_TYPE, &time1);
#if (defined(__FMA__))
triad_fma_asm_repeat_unroll16(x,y,z,n,repeat);
#elif (defined(__AVX__))
triad_avx_asm_repeat_unroll16(x,y,z,n,repeat);
#else
triad_sse_asm_repeat_unroll16(x,y,z,n,repeat);
#endif
clock_gettime(TIMER_TYPE, &time2);
dtime = time_diff(time1,time2);
rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
printf("unroll16 rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
}
}
The NASM code using the System V AMD64 ABI.
triad_fma_asm.asm:
global triad_fma_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;z[i] = y[i] + 3.14159*x[i]
pi: dd 3.14159
;align 16
section .text
triad_fma_asm_repeat:
shl rcx, 2
add rdi, rcx
add rsi, rcx
add rdx, rcx
vbroadcastss ymm2, [rel pi]
;neg rcx
align 16
.L1:
mov rax, rcx
neg rax
align 16
.L2:
vmovaps ymm1, [rdi+rax]
vfmadd231ps ymm1, ymm2, [rsi+rax]
vmovaps [rdx+rax], ymm1
add rax, 32
jne .L2
sub r8d, 1
jnz .L1
vzeroupper
ret
global triad_fma_asm_repeat_unroll16
section .text
triad_fma_asm_repeat_unroll16:
shl rcx, 2
add rcx, rdi
vbroadcastss ymm2, [rel pi]
.L1:
xor rax, rax
mov r9, rdi
mov r10, rsi
mov r11, rdx
.L2:
%assign unroll 32
%assign i 0
%rep unroll
vmovaps ymm1, [r9 + 32*i]
vfmadd231ps ymm1, ymm2, [r10 + 32*i]
vmovaps [r11 + 32*i], ymm1
%assign i i+1
%endrep
add r9, 32*unroll
add r10, 32*unroll
add r11, 32*unroll
cmp r9, rcx
jne .L2
sub r8d, 1
jnz .L1
vzeroupper
ret
triad_ava_asm.asm:
global triad_avx_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
align 16
section .text
triad_avx_asm_repeat:
shl rcx, 2
add rdi, rcx
add rsi, rcx
add rdx, rcx
vbroadcastss ymm2, [rel pi]
;neg rcx
align 16
.L1:
mov rax, rcx
neg rax
align 16
.L2:
vmulps ymm1, ymm2, [rdi+rax]
vaddps ymm1, ymm1, [rsi+rax]
vmovaps [rdx+rax], ymm1
add rax, 32
jne .L2
sub r8d, 1
jnz .L1
vzeroupper
ret
global triad_avx_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
align 16
section .text
triad_avx_asm_repeat2:
shl rcx, 2
vbroadcastss ymm2, [rel pi]
align 16
.L1:
xor rax, rax
align 16
.L2:
vmulps ymm1, ymm2, [rdi+rax]
vaddps ymm1, ymm1, [rsi+rax]
vmovaps [rdx+rax], ymm1
add eax, 32
cmp eax, ecx
jne .L2
sub r8d, 1
jnz .L1
vzeroupper
ret
global triad_avx_asm_repeat_unroll16
align 16
section .text
triad_avx_asm_repeat_unroll16:
shl rcx, 2
add rcx, rdi
vbroadcastss ymm2, [rel pi]
align 16
.L1:
xor rax, rax
mov r9, rdi
mov r10, rsi
mov r11, rdx
align 16
.L2:
%assign unroll 16
%assign i 0
%rep unroll
vmulps ymm1, ymm2, [r9 + 32*i]
vaddps ymm1, ymm1, [r10 + 32*i]
vmovaps [r11 + 32*i], ymm1
%assign i i+1
%endrep
add r9, 32*unroll
add r10, 32*unroll
add r11, 32*unroll
cmp r9, rcx
jne .L2
sub r8d, 1
jnz .L1
vzeroupper
ret
triad_sse_asm.asm:
global triad_sse_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
;align 16
section .text
triad_sse_asm_repeat:
shl rcx, 2
add rdi, rcx
add rsi, rcx
add rdx, rcx
movss xmm2, [rel pi]
shufps xmm2, xmm2, 0
;neg rcx
align 16
.L1:
mov rax, rcx
neg rax
align 16
.L2:
movaps xmm1, [rdi+rax]
mulps xmm1, xmm2
addps xmm1, [rsi+rax]
movaps [rdx+rax], xmm1
add rax, 16
jne .L2
sub r8d, 1
jnz .L1
ret
global triad_sse_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
;align 16
section .text
triad_sse_asm_repeat2:
shl rcx, 2
movss xmm2, [rel pi]
shufps xmm2, xmm2, 0
align 16
.L1:
xor rax, rax
align 16
.L2:
movaps xmm1, [rdi+rax]
mulps xmm1, xmm2
addps xmm1, [rsi+rax]
movaps [rdx+rax], xmm1
add eax, 16
cmp eax, ecx
jne .L2
sub r8d, 1
jnz .L1
ret
global triad_sse_asm_repeat_unroll16
section .text
triad_sse_asm_repeat_unroll16:
shl rcx, 2
add rcx, rdi
movss xmm2, [rel pi]
shufps xmm2, xmm2, 0
.L1:
xor rax, rax
mov r9, rdi
mov r10, rsi
mov r11, rdx
.L2:
%assign unroll 8
%assign i 0
%rep unroll
movaps xmm1, [r9 + 16*i]
mulps xmm1, xmm2,
addps xmm1, [r10 + 16*i]
movaps [r11 + 16*i], xmm1
%assign i i+1
%endrep
add r9, 16*unroll
add r10, 16*unroll
add r11, 16*unroll
cmp r9, rcx
jne .L2
sub r8d, 1
jnz .L1
ret
double rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
. The percentage is100*rate/peak
. The peak isfrequency*96
which in my case is1.3*96=124.8 billion bytes/sec
. The 96 is 32*2 byte reads + 32*1 byte write. – Exteriorvfmadd231ps ymm1, ymm2, [rsi+rax]
withvaddps ymm1, [rsi+rax]
, and even withvmovaps ymm1, [rsi+rax]
(!). The latter proves the issue isn't due to dependencies but due to decoding or quirks of execution. The core isn't overloaded - It has 2 vector reads, 1 vector write, a scalar ALU op and a branch. Perhaps uop fusion doesn't occur. Interestingly, but probably not relevant here, is that a predicted-taken branch takes 1-2 CC whereas a predicted not-taken takes 0.5-1 CC. – Fourteenvaddps
. That's why I unrolled eight times here stackoverflow.com/questions/25774190/…. It's basically the same thing but using addition instead of fma. You could also tryvmulps
but it won't make a difference:vfmaddxxxps
,vmulps
, orvaddps
all have the same result in performance. I mean they each get a little less than 64 bytes/cycle instead of 96 bytes/cycle. – Exterioradd
andjne
) shows noticeable performance drop. – Childers[off]
has almost ideal performance (but is useless), and[r+r]
is much slower. I have no explanation for all this, but knowing these facts allows some optimizations: optimal array size is about 16KB (which is slightly out of L1 for 3 arrays) and best access mode is[r+off]
. This code has ~75% efficiency (if not unrolled). Compile it as in OP, link withld
. – Childers[off]
and[r+r]
. I tried unrolling using[r+r]
with theadd
but without the jump but it never breaks 66%. I only get better than 66% with[r+offset]
(but not with[base+mult*index+offset]
). That's strange because[r+offset]
produces longer instructions than[r+r]
(7 bytes vs. 6). And thanks for the code. I learned a lot by reading it (I'm new to x86 assembly and NASM). I'll give it a try soon. – Exteriorrcx, 2539062 ; 1.3GHz -> 1 sec/clock
. This finishes in 1.375 seconds. That's about 73%. I like your solution skipping C. How do you time it? Given an executableevgengy
I didtime ./evgengy
. Now I need some time to diagnosis your code further. – Exterior[off]
in your code? If I just read and write to the same offset it takes 5.6 seconds (which I think is due to read after write stalls). – Exteriortime
utility to measure execution time. Only before computing the inverse I subtract 0.05 sec. for load+warmup+etc. To get the exact value to be subtracted just run the program with empty loop (onlyadd
/jne
). By [off] I mean offset. Definitely I write to some[off]
but then read from[off+64]
to avoid stalls. BTW it's possible to improve unroll method and gain 2..3% more for performance. This code gives 88% if unrolled 2*2 times and 93% if unrolled 4*2 times. – Childers[offset+64]
on the write. Now I see what you mean. Though I get about90%
doing that1/(1.173-0.05)
. Sounds like you're making good progress. I'm not completely opposed to some unrolling. I just want to understand why it's necessary and if it's possible to do better than 93% or so. – Exteriorn
on AVX. However, last time I tried it on haswell with FMA it got less than 90%. I infer that that means that even Intel is struggling to get high efficiency on Haswell. In other bandwidth tests that I do doing reductions (e.g.sum*=x[i]
) I get only about 70%. – Exteriorrdi
,rsi
,rdx
) and use the repeat registerrdc
. – Exterioreff = (3*unroll)/(3*unroll+4)
. So for unrolling 16 times the efficiency should be 92%. Eight times should be about 86%. Evgeny got 87% unrolling eight times at stackoverflow.com/questions/25774190/…. So I'm not totally sure my formula is correct but it seems about right. – Exteriortriad_fma_asm_repeat
it just sees the instruction for a function call. How did you do this? – Exteriordb 0Dh, 0Fh, ......
at the relevant spot. – Fourteen[r+r]
but even when the store is[r+offset]
the loads don't do micro-op fusion with[r+r]
. The only solution which get's 4 microps per cycle without unrolling is to use static arrays with[r+offset]
. – Exterior[r+r]
for loads as we're doing right now, but make only the destination array static? Also, you're actually sustaining 5.125 uops/CC right now: 164/32=5.125. – Fourteen[r+r]
with the loads and a static array only for the store with[r+offset]
. The fma and load does not fuse still. So even after you fix the store there is still the problem of fusing the FMA and load. So I think the only solution is[r+offset]
for the loads and stores. I'll try and look into that today. – Exteriorvoid
notfloat
– Uke