Is mov r64, m64 one cycle or two cycle latency?

I'm on IvyBridge, I wrote the following simple program to measure the latency of mov:

section .bss
align   64
buf:    resb    64

section .text
global _start
_start:
    mov rcx,    1000000000
    xor rax,    rax
loop:
    mov rax,    [buf+rax]

    dec rcx,
    jne loop

    xor rdi,    rdi
    mov rax,    60
    syscall

perf shows result:

 5,181,691,439      cycles

So every iteration has 5 cycle latency. I searched from multiple online resource, the latency of L1 cache is 4. Therefore the latency of mov itself should be 1.

However, Agner instruction table shows mov r64, m64 has 2 cycle latency for IveBridge. I don't know other place to find this latency.

Do I make mistake in the above measuring program? Why this program shows the mov latency is 1 rather than 2?

(I got the same result by using L2 cache: if buf+rax is L1 missing L2 hit, similar measuring shows mov rax, [buf+rax] has 12 cycle latency. IvyBridge has 11 cycle latency L2 cache, so the mov latency is still 1 cycle)

Therefore the latency of mov itself should be 1.

No, the mov is the load. There isn't also an ALU mov operation that the data has to go through.

Agner Fog's instruction tables don't contain the load-use latency (like you're measuring). They're in his microarch PDF in tables in the "cache and memory access" section for each uarch. e.g. SnB/IvB (Section 9.13) has a "Level 1 data" row with "32 kB, 8 way, 64 B line size, latency 4, per core".

This 4-cycle latency is the load-use latency for a chain of dependent instructions like mov rax, [rax]. You're measuring 5 cycles because you're using an addressing mode other than [reg + 0..2047]. With small displacements, the load unit speculates that using the base register directly as the input to TLB lookup will give the same result as using the adder result. Is there a penalty when base+offset is in a different page than the base?. So your addressing mode [disp32 + rax] uses the normal path, waiting one more cycle for the adder result before starting the TLB lookup in the load port.

For most operations between different domains (like integer registers and XMM registers), you can only really measure a round-trip like movd xmm0,eax / mov eax, xmm0, and it's hard to pick that apart and figure out what the latency of each instruction is separately¹.

For loads, you can chain to another load to measure cache load-use latency, instead of a chain of store/reload.

Agner for some reason decided to only look at store-forwarding latency for his tables, and to make a totally arbitrary choice of how to split up the store-forwarding latency between the store and the reload.

(from the "definition of terms" sheet of his instruction table spreadsheet, way at the left after the Introduction)

It is not possible to measure the latency of a memory read or write instruction with software methods. It is only possible to measure the combined latency of a memory write followed by a memory read from the same address. What is measured here is not actually the cache access time, because in most cases the microprocessor is smart enough to make a "store forwarding" directly from the write unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables. But in fact, the only value that makes sense to performance optimization is the sum of the write time and the read time.

This is obviously incorrect: L1d load-use latency is a thing for pointer-chasing through levels of indirection. You could argue that it's simply variable because some loads can miss in cache, but if you're going to pick something to put in your table you might as well pick the L1d load-use latency. And then calculate the store latency numbers such that store+load latency = store-forwarding latency like now. Intel Atom would then have store latency = -2, because it has 3c L1d load-use latency, but 1c store-forwarding according to Agner's uarch guide.

This is less easy for loads into XMM or YMM registers, for example, but still possible once you work out the latency of movq rax, xmm0. It's harder for x87 registers, because there's no way to directly get the data from st0 into eax/rax through the ALU, instead of a store/reload. But perhaps you could do something with an FP compare like fucomi that sets integer FLAGS directly (on CPUs that have it: P6 and later).

Still, it would have been a lot better for at least the integer load latency to reflect pointer-chasing latency. IDK if anyone's offered to update Agner's tables for him, or if he'd accept such an update. It would take fresh testing on most uarches to be sure you had the right load-use latency for different register sets, though.

footnote 1: For example, http://instlatx64.atw.hu doesn't try, and just says "diff. reg. set" in the latency column, with useful data only in the throughput column. But they have lines for the MOVD r64, xmm+MOVD xmm, r64 round trip, in this case 2 cycles total on IvB so we can be pretty confident they're only 1c each way. Not zero one way. :P

But for loads into integer registers, they do show IvB's 4-cycle load-use latency for MOV r32, [m32], because apparently they test with a [reg + 0..2047] addressing mode.

https://uops.info/ is quite good, but gives pretty loose bounds on latency: IIRC, they construct a loop with a round trip (e.g. store and reload, or xmm->integer and integer->xmm), and then give an upper bound on latency assumed that every other step was only 1 cycle. See What do multiple values or ranges means as the latency for a single instruction? for more.

Other sources of cache-latency info:

https://www.7-cpu.com/ has good details for lots of other uarches, even many non-x86 like ARM, MIPS, PowerPC, and IA-64.

The pages have other details like cache and TLB sizes, TLB timing, branch miss experiment results, and memory bandwidth. The cache latency details look like this:

(from their Skylake page)

L1 Data Cache Latency = 4 cycles for simple access via pointer

L1 Data Cache Latency = 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).

L2 Cache Latency = 12 cycles

L3 Cache Latency = 42 cycles (core 0) (i7-6700 Skylake 4.0 GHz)

L3 Cache Latency = 38 cycles (i7-7700K 4 GHz, Kaby Lake)

RAM Latency = 42 cycles + 51 ns (i7-6700 Skylake)

Recommended topics

Hot tags