arm64 assembly: LDP vs. LD4 execution time - McMap

About

arm64 assembly: LDP vs. LD4 execution time

Asked 4/7, 2020 at 22:7 Answered 4/7, 2020 at 22:7

performance assembly arm simd arm64

E

0

7

Suppose I want to load four consecutive aarch64 vector registers with values from consecutive memory locations. One way to do this is

ldp   q0, q1, [x0]
ldp   q2, q3, [x0, 32]

According to the ARM optimization guide for Cortex A72 (my target processor) each of these two instructions takes 6 cycles of execution time on the L-pipeline, for a total of 12 cycles.

But I can also use a load with interleaving, which allows me to load all 4 registers at once:

ld4   {v0.2d, v1.2d, v2.2d, v3.2d}, [x0]

This also saves me code size and should only need 8 cycles of execution time in total, acording to the above guide.

I know that interleaving means that the data is stored differently in my registers, but it should be assumed that my later use can handle both interleaved and non-interleaved data. (For example, summing an array.)

Is LD4 really faster than twice LDP here, as I read from the theoretical execution timings? The same question could of course also be asked for STP and ST4. Maybe there is anyone here who has already carried out benchmarks on this topic.

(And do I even interpret the timings correctly?)

Engird answered 4/7, 2020 at 22:7 Comment(5)

As a rule of thumb: if you can do the same thing with less instructions, the code with less instructions is likely to be faster. – Izabel 4/7, 2020 at 23:53

Visiting this question again, why don't you use ld1 {v0.2d, v1.2d, v2.2d, v3.2d}, [x0] instead? That does the same as your ldp pair but doesn't interleave. – Izabel 13/8, 2020 at 12:58

According to the manual you linked, this ld1 instruction takes a total of 8 cycles which is 4 cycles less than two ldp. Should be good. – Izabel 13/8, 2020 at 13:7

@Izabel Thanks for the comment, ld1 seems to work too (for all the 16b/8h/4s/2d variants). – Engird 13/8, 2020 at 14:40

Wait a second, 6 cycles is the latency for ldp. But the two loads are independent, so their latencies don't add. The manual says the throughput is 1 per 2 cycles, so it could be that the second ldp can begin executing 2 cycles after the first one, for a total latency of 8 cycles, matching ld1. Still, there doesn't seem to be any way for ld1 to be higher latency, and they are the same in terms of throughput (ldp = 1 per 2 cycles, ld1 = 1 per 4 cycles), and ld1 is shorter code, so it still wins. – Aplanatic 20/2, 2022 at 0:52

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.