AVX VMOVDQA slower than two SSE MOVDQA?

While I was working on my fast ADD loop (Speed up x64 assembler ADD loop), I was testing memory access with SSE and AVX instructions. To add I have to read two inputs and produce one output. So I wrote a dummy routine that reads two x64 values into registers and write one back to memory without doing any operation. This is of course useless, I only did it for benchmarking.

I use an unrolled loop that handles 64 bytes per loop. It is comprised of 8 blocks like this:

mov rax, QWORD PTR [rdx+r11*8-64]
mov r10, QWORD PTR [r8+r11*8-64]
mov QWORD PTR [rcx+r11*8-64], rax

Then I upgraded it to SSE2. Now I use 4 blocks like this:

movdqa xmm0, XMMWORD PTR [rdx+r11*8-64]
movdqa xmm1, XMMWORD PTR [r8+r11*8-64]
movdqa XMMWORD PTR [rcx+r11*8-64], xmm0

And later on I used AVX (256 bit per register). I have 2 blocks like this:

vmovdqa ymm0, YMMWORD PTR [rdx+r11*8-64]
vmovdqa ymm1, YMMWORD PTR [r8+r11*8-64]
vmovdqa YMMWORD PTR [rcx+r11*8-64], ymm0

So far, so not-so-extremely-spectacular. What is interesting is the benchmarking result: When I run the three different approaches on 1k+1k=1k 64-bit words (i.e. two times 8 kb of input and one time 8kb of output) I get strange results. Each of the following timings is for processing two times 64 bytes input into 64 bytes of output.

The x64 register method runs at about 15 cycles/64 bytes
The SSE2 method runs at about 8.5 cycles/64 bytes
The AVX method runs at about 9 cycles/64 bytes

My question is: how come the AVX method is slower (though not a lot) than the SSE2 method? I expected it to be at least on par. Does using the YMM registers cost so much extra time? The memory was aligned (you get GPF's otherwise).

Does anyone have an explanation for this?

On Sandybridge/Ivybridge, 256b AVX loads and stores are cracked into two 128b ops [as Peter Cordes notes, these aren't quite µops, but it requires two cycles for the operation to clear the port] in the load/store execution units, so there's no reason to expect the version using those instructions to be much faster.

Why is it slower? Two possibilities come to mind:

for base + index + offset addressing, the latency of a 128b load is 6 cycles, whereas the latency of a 256b load is 7 cycles (Table 2-8 in the Intel Optimization Manual). Although your benchmark should be bound by thoughput and not latency, the longer latency means that the processor takes longer to recover from any hiccups (pipeline bubbles or prediction misses or interrupt servicing or ...), which does have some impact.
in 11.6.2 of the same document, Intel suggests that the penalty for cache line and page crossing may be larger for 256b loads than it is for 128b loads. If your loads are not all 32-byte aligned, this may also explain the slowdown you are seeing when using the 256b load/store operations:

Example 11-12 shows two implementations for SAXPY with unaligned addresses. Alternative 1 uses 32 byte loads and alternative 2 uses 16 byte loads. These code samples are executed with two source buffers, src1, src2, at 4 byte offset from 32- Byte alignment, and a destination buffer, DST, that is 32-Byte aligned. Using two 16- byte memory operations in lieu of 32-byte memory access performs faster.

Recommended topics

Hot tags