Suppose I want to load four consecutive aarch64 vector registers with values from consecutive memory locations. One way to do this is
ldp q0, q1, [x0]
ldp q2, q3, [x0, 32]
According to the ARM optimization guide for Cortex A72 (my target processor) each of these two instructions takes 6 cycles of execution time on the L-pipeline, for a total of 12 cycles.
But I can also use a load with interleaving, which allows me to load all 4 registers at once:
ld4 {v0.2d, v1.2d, v2.2d, v3.2d}, [x0]
This also saves me code size and should only need 8 cycles of execution time in total, acording to the above guide.
I know that interleaving means that the data is stored differently in my registers, but it should be assumed that my later use can handle both interleaved and non-interleaved data. (For example, summing an array.)
Is LD4 really faster than twice LDP here, as I read from the theoretical execution timings? The same question could of course also be asked for STP and ST4. Maybe there is anyone here who has already carried out benchmarks on this topic.
(And do I even interpret the timings correctly?)
ld1 {v0.2d, v1.2d, v2.2d, v3.2d}, [x0]
instead? That does the same as yourldp
pair but doesn't interleave. – Izabelld1
instruction takes a total of 8 cycles which is 4 cycles less than twoldp
. Should be good. – Izabelld1
seems to work too (for all the16b
/8h
/4s
/2d
variants). – Engirdldp
. But the two loads are independent, so their latencies don't add. The manual says the throughput is 1 per 2 cycles, so it could be that the secondldp
can begin executing 2 cycles after the first one, for a total latency of 8 cycles, matchingld1
. Still, there doesn't seem to be any way forld1
to be higher latency, and they are the same in terms of throughput (ldp = 1 per 2 cycles, ld1 = 1 per 4 cycles), andld1
is shorter code, so it still wins. – Aplanatic