While I was working on my fast ADD loop (Speed up x64 assembler ADD loop), I was testing memory access with SSE and AVX instructions. To add I have to read two inputs and produce one output. So I wrote a dummy routine that reads two x64 values into registers and write one back to memory without doing any operation. This is of course useless, I only did it for benchmarking.
I use an unrolled loop that handles 64 bytes per loop. It is comprised of 8 blocks like this:
mov rax, QWORD PTR [rdx+r11*8-64]
mov r10, QWORD PTR [r8+r11*8-64]
mov QWORD PTR [rcx+r11*8-64], rax
Then I upgraded it to SSE2. Now I use 4 blocks like this:
movdqa xmm0, XMMWORD PTR [rdx+r11*8-64]
movdqa xmm1, XMMWORD PTR [r8+r11*8-64]
movdqa XMMWORD PTR [rcx+r11*8-64], xmm0
And later on I used AVX (256 bit per register). I have 2 blocks like this:
vmovdqa ymm0, YMMWORD PTR [rdx+r11*8-64]
vmovdqa ymm1, YMMWORD PTR [r8+r11*8-64]
vmovdqa YMMWORD PTR [rcx+r11*8-64], ymm0
So far, so not-so-extremely-spectacular. What is interesting is the benchmarking result: When I run the three different approaches on 1k+1k=1k 64-bit words (i.e. two times 8 kb of input and one time 8kb of output) I get strange results. Each of the following timings is for processing two times 64 bytes input into 64 bytes of output.
- The x64 register method runs at about 15 cycles/64 bytes
- The SSE2 method runs at about 8.5 cycles/64 bytes
- The AVX method runs at about 9 cycles/64 bytes
My question is: how come the AVX method is slower (though not a lot) than the SSE2 method? I expected it to be at least on par. Does using the YMM registers cost so much extra time? The memory was aligned (you get GPF's otherwise).
Does anyone have an explanation for this?