arm64 assembly: LDP vs. LD4 execution time
Asked Answered
E

0

7

Suppose I want to load four consecutive aarch64 vector registers with values from consecutive memory locations. One way to do this is

ldp   q0, q1, [x0]
ldp   q2, q3, [x0, 32]

According to the ARM optimization guide for Cortex A72 (my target processor) each of these two instructions takes 6 cycles of execution time on the L-pipeline, for a total of 12 cycles.

But I can also use a load with interleaving, which allows me to load all 4 registers at once:

ld4   {v0.2d, v1.2d, v2.2d, v3.2d}, [x0]

This also saves me code size and should only need 8 cycles of execution time in total, acording to the above guide.

I know that interleaving means that the data is stored differently in my registers, but it should be assumed that my later use can handle both interleaved and non-interleaved data. (For example, summing an array.)

Is LD4 really faster than twice LDP here, as I read from the theoretical execution timings? The same question could of course also be asked for STP and ST4. Maybe there is anyone here who has already carried out benchmarks on this topic.

(And do I even interpret the timings correctly?)

Engird answered 4/7, 2020 at 22:7 Comment(5)
As a rule of thumb: if you can do the same thing with less instructions, the code with less instructions is likely to be faster.Izabel
Visiting this question again, why don't you use ld1 {v0.2d, v1.2d, v2.2d, v3.2d}, [x0] instead? That does the same as your ldp pair but doesn't interleave.Izabel
According to the manual you linked, this ld1 instruction takes a total of 8 cycles which is 4 cycles less than two ldp. Should be good.Izabel
@Izabel Thanks for the comment, ld1 seems to work too (for all the 16b/8h/4s/2d variants).Engird
Wait a second, 6 cycles is the latency for ldp. But the two loads are independent, so their latencies don't add. The manual says the throughput is 1 per 2 cycles, so it could be that the second ldp can begin executing 2 cycles after the first one, for a total latency of 8 cycles, matching ld1. Still, there doesn't seem to be any way for ld1 to be higher latency, and they are the same in terms of throughput (ldp = 1 per 2 cycles, ld1 = 1 per 4 cycles), and ld1 is shorter code, so it still wins.Aplanatic

© 2022 - 2024 — McMap. All rights reserved.