Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU
Asked Answered
G

3

7

Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port:

Instruction B I0 I1 M L S F0 F1
.LBB0_1:
ldr q3, [x1], #16 0.5 0.5 1
ldr q4, [x2], #16 0.5 0.5 1
add x8, x8, #4 0.5 0.5
cmp x8, #508 0.5 0.5
mul v5.4s, v3.4s, v4.4s 2
mul v5.4s, v5.4s, v0.4s 2
smull v6.2d, v5.2s, v1.2s 1
smull2 v5.2d, v5.4s, v2.4s 1
smlal v6.2d, v3.2s, v4.2s 1
smlal2 v5.2d, v3.4s, v4.4s 1
uzp2 v3.4s, v6.4s, v5.4s 1
str q3, [x0], #16 0.5 0.5 1
b.lo .LBB0_1 1
Total port pressure 1 2.5 2.5 0 2 1 8 1

Although uzp2 could run on either the F0 or F1 ports, I chose to attribute it entirely to F1 due to high pressure on F0 and zero pressure on F1 other than this instruction.

There are no dependencies between loop iterations, other than the loop counter and array pointers; and these should be resolved very quickly, compared to the time taken for the rest of the loop body.

Thus, my intuition is that this code should be throughput limited, and considering the worst pressure is on F0, run in 8 cycles per iteration (unless it hits a decoding bottleneck or cache misses). The latter is unlikely given the streaming access pattern, and the fact that arrays comfortably fit in L1 cache. As for the former, considering the constraints listed on section 4.1 of the optimization manual, I project that the loop body is decodable in only 8 cycles.

Yet microbenchmarking indicates that each iteration of the loop body takes 12.5 cycles on average. If no other plausible explanation exists, I may edit the question including further details about how I benchmarked this code, but I'm fairly certain the difference can't be attributed to benchmarking artifacts alone. Also, I have tried to increase the number of iterations to see if performance improved towards an asymptotic limit due to startup/cool-down effects, but it appears to have done so already for the selected value of 128 iterations displayed above.

Manually unrolling the loop to include two calculations per iteration decreased performance to 13 cycles; however, note that this would also duplicate the number of load and store instructions. Interestingly, if the doubled loads and stores are instead replaced by single LD1/ST1 instructions (two-register format) (e.g. ld1 { v3.4s, v4.4s }, [x1], #32) then performance improves to 11.75 cycles per iteration. Further unrolling the loop to four calculations per iteration, while using the four-register format of LD1/ST1, improves performance to 11.25 cycles per iteration.

In spite of the improvements, the performance is still far away from the 8 cycles per iteration that I expected from looking at resource pressures alone. Even if the CPU made a bad scheduling call and issued uzp2 to F0, revising the resource pressure table would indicate 9 cycles per iteration, still far from actual measurements. So, what's causing this code to run so much slower than expected? What kind of effects am I missing in my analysis?

EDIT: As promised, some more benchmarking details. I run the loop 3 times for warmup, 10 times for say n = 512, and then 10 times for n = 256. I take the minimum cycle count for the n = 512 runs and subtract from the minimum for n = 256. The difference should give me how many cycles it takes to run for n = 256, while canceling out the fixed setup cost (code not shown). In addition, this should ensure all data is in the L1 I and D cache. Measurements are taken by reading the cycle counter (pmccntr_el0) directly. Any overhead should be canceled out by the measurement strategy above.

Graphics answered 5/11, 2021 at 15:31 Comment(16)
ARM performance numbers cannot include the chip (not made by arm) or system (not made by arm) issues. And it is pipelined so performance is not expected to be predictable. Have you taken all of these factors into account?Charlotte
Did you try different alignments of the code relative to memory space to account for fetch lines? (cached or not)Charlotte
@Charlotte thanks for reminding me of the alignment issues. By aligning the function to a 16-byte boundary, the performance of the original code (the one in the table) improved to 12.25 cycles/iteration. Still over 50% above the 8 cycles/iteration that I expected.Graphics
but also if you change the alignment some more, it is where the loops if any land in fetch lines that affect if there is a bus transaction/clock savings per fetch, not necessarily the start of the function being aligned. Interesting to see that it was a contributor to your benchmark results.Charlotte
@Charlotte you're right, and indeed I also ensured this by passing the -mllvm -align-all-blocks=4 option to clang when compiling. Thus, any branch targets (not only function boundaries) should be aligned to 16 bytes.Graphics
Is your data hot in L1d cache when this runs? If not, memory or some outer-level cache bandwidth could be a concern. (I guess you're running repeatedly over the same small buffer, so that should be fine if there's no huge page-fault penalty on the first iteration).Spartan
A72 is an out-of-order design so it should be able to hide the load and ALU latency, but maybe its out-of-order capability isn't "deep" enough to see across enough iterations to hide that much latency? If unrolling helps by having two dependency chains interleaved, that might be a sign that OoO exec isn't hiding all the latency.Spartan
@PeterCordes Even the old A15's reorder buffer is 128 entry long. I don't think A72's to be any less.Surrealism
You already mentioned that there is no inter-iteration dependency. But could you double check if x0 's memory doesn't overlap with others? I can't think of any other reason than "A72 just sucks" (and it does) How about benchmarking on other chips? (Different A72 or A75)Surrealism
And besides, the loop consists of 13 instructions, and you could reduce it to 12 instructions by counting down - zero terminate. Could you check if this affects the performance?Surrealism
@Jake'Alquimista'LEE Googling seems to confirm a 128-entry reorder buffer for A72. There is no aliasing between buffers -- I declare three different arrays for each.Graphics
@Jake'Alquimista'LEE: ROB size is relevant for hiding one slow instruction like a cache-miss. To overlap long dep chains, you also need a large-enough scheduler (RS) for not-yet-executed instructions. (See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths). A large scheduler costs a lot more power than a large ROB, because it's not allocated / reclaimed in order and has to be scanned for the oldest-ready instruction each cycle. Some CPUs use a unified scheduler, others have per port scheduling queues.Spartan
So I'd say we can't rule out OoO exec limitations just based on the ROB size. The scheduler might be large enough, but maybe not. (Also, the physical register file size can be another limitation (e.g. x86 experiments, but it's usually close to the ROB size, and this loop has a mix of integer and vector regs so is probably distributing its PRF demands over both register files, assuming Cortex-A72 is like most designs with separate int vs. FP/SIMD register files.)Spartan
@PeterCordes according to this article, the reservation stations in the Cortex-A72 are per-port rather than global. Guess I'll have to tear down and rebuild my mental model for writing code for this core, seeing as it should only be able to look e.g. at most 8 instructions ahead in the F0 port.Graphics
For reference, here is a fairly complete block diagram of A72, also confirming the per-port schedulers.Graphics
I think per-port scheduling queues take less power for the same amount of entries, so I'm not surprised A72 does that. AMD does that on their high-end CPUs (Zen), and I think Intel has some specialization of the entries in their "unified" scheduler since at least Skylake, like not all of them can hold ALU uops. Intel's low-power Silvermont or 3-wide Goldmont is maybe closer to A72 in terms of power vs. speed tradeoffs than most x86, and yeah it also uses per-port queues. realworldtech.com/silvermont/4Spartan
G
3

Starting from Jake's code, reducing the unrolling factor by half, changing some of the register allocation, and trying many different variations of load/store instructions (as well as different addressing modes) and instruction scheduling, I finally arrived at the following solution:

    ld1     {v16.4s, v17.4s, v18.4s, v19.4s}, [pSrc1], #64
    ld1     {v20.4s, v21.4s, v22.4s, v23.4s}, [pSrc2], #64

    add     count, pDst, count, lsl #2

    // initialize v0/v1

loop:
    smull   v24.2d, v20.2s, v16.2s
    smull2  v25.2d, v20.4s, v16.4s
    uzp1    v2.4s, v24.4s, v25.4s

    smull   v26.2d, v21.2s, v17.2s
    smull2  v27.2d, v21.4s, v17.4s
    uzp1    v3.4s, v26.4s, v27.4s

    smull   v28.2d, v22.2s, v18.2s
    smull2  v29.2d, v22.4s, v18.4s
    uzp1    v4.4s, v28.4s, v29.4s

    smull   v30.2d, v23.2s, v19.2s
    smull2  v31.2d, v23.4s, v19.4s
    uzp1    v5.4s, v30.4s, v31.4s

    mul     v2.4s, v2.4s, v0.4s
    ldp     q16, q17, [pSrc1]
    mul     v3.4s, v3.4s, v0.4s
    ldp     q18, q19, [pSrc1, #32]
    add     pSrc1, pSrc1, #64

    mul     v4.4s, v4.4s, v0.4s
    ldp     q20, q21, [pSrc2]
    mul     v5.4s, v5.4s, v0.4s
    ldp     q22, q23, [pSrc2, #32]
    add     pSrc2, pSrc2, #64

    smlal   v24.2d, v2.2s, v1.2s
    smlal2  v25.2d, v2.4s, v1.4s
    uzp2    v2.4s, v24.4s, v25.4s
    str     q24, [pDst], #16

    smlal   v26.2d, v3.2s, v1.2s
    smlal2  v27.2d, v3.4s, v1.4s
    uzp2    v3.4s, v26.4s, v27.4s
    str     q25, [pDst], #16

    smlal   v28.2d, v4.2s, v1.2s
    smlal2  v29.2d, v4.4s, v1.4s
    uzp2    v4.4s, v28.4s, v29.4s
    str     q26, [pDst], #16

    smlal   v30.2d, v5.2s, v1.2s
    smlal2  v31.2d, v5.4s, v1.4s
    uzp2    v5.4s, v30.4s, v31.4s
    str     q27, [pDst], #16

    cmp     count, pDst
    b.ne    loop

Note that, although I have carefully reviewed the code, I haven't tested whether it actually works, so there may be something missing that would impact performance. A final iteration of the loop, removing the load insructions, is required to prevent an out-of-bounds memory access; I omitted this to save some space.

Performing a similar analysis as to that of the original question, assuming the code is fully throughput-limited, would suggest that this loop would take 24 cycles. Normalizing to the same metric as used elsewhere (i.e. cycles per 4-element iteration), this would work out to 6 cycles/iteration. Benchmarking the code resulting in 26 cycles per loop execution, or in the normalized metric, 6.5 cycles/iteration. While not the bare minimum supposedly achievable, it comes very close to this.

Some notes for anyone else who stumbles across this question, after scratching their heads about Cortex-A72 performance:

  1. The schedulers (reservation stations) are per-port rather than global (see this article and this block diagram). Unless your code has a very balanced instruction mix among loads, stores, scalar, Neon, branches, etc., then the OoO window will be smaller than you would expect, sometimes very much so. This code in particular is a pathological case for per-port schedulers. since 70% of all instructions are Neon, and 50% of all instructions are multiplications (which only run on the F0 port). For these multiplications, the OoO window is a very anemic 8 instructions, so don't expect the CPU to be looking at the next loop iteration's instructions while executing the current iteration.

  2. Attempting to further reduce the unrolling factor by half results in a large (23%) slowdown. My guess for the cause is the shallow OoO window, due to the per-port schedulers and the high prevalence of instructions bound to port F0, as explained in point 1 above. Without being able to look at the next iteration, there is less parallelism to be extracted, so the code becomes latency- rather than throughput-bound. Thus, it appears that interleaving multiple iterations of a loop is an important optimization strategy to consider for this core.

  3. One must pay attention to the specific addressing mode used for loads. Replacing the immediate post-index addressing mode used in the original code with immediate offset, and then manually performing incrementing the pointers elsewhere, resulted in performance gains, but only for the loads (stores were unaffected). In section 4.5 ("Load/Store Throughput") of the optimization manual, this is hinted in the context of a memory copy routine, but no rationale is given. However, I believe this is explained by point 4 below.

  4. Apparently the main bottleneck of this code is writing to the register file: according to this answer to another SO question, the register file only supports writing 192 bits per cycle. This may explain why loads should avoid the use of addressing modes with writeback (pre- and post-index), as this consumes an extra 64 bits writing the result back to the register file. It's all too easy to exceed this limit while using Neon instructions and vector loads (even more so when using LDP and 2/3/4-register versions of LD1), without the added pressure of writing back the incremented address. Knowing this, I also decide to replace the original subs in Jake's code with a comparison to pDst, since comparisons don't write to the register file -- and this actually improved performance by 1/4 of a cycle.

Interestingly, adding up the number of bits written to the register file during one execution of the loop results in 4992 bits (I have no idea whether writes to PC, specifically by the b.ne instruction, should be included in the tally or not; I arbitrarily chose not to). Given the 192-bit/cycle limit, this works out to a minimum of 26 cycles to write all these results to the register file across the loop. So it appears that the code above can't be made faster by rescheduling instructions alone.

Theoretically it might be possible to shed 1 cycle by switching the addressing mode of the stores to immediate offset, and then including an extra instruction to explicitly increment pDst. For the original code, each of the 4 stores would write 64 bits to pDst, for a total of 256 bits, compared to a single 64-bit write if pDst were explicitly incremented once. Thus, this change would result in saving 192 bits, i.e., 1 cycle's worth of register file writes. I attempted this change, trying to schedule the increments of pSrc1/pSrc2/pDst across many different points of the code, but unfortunately I was only able to slow down rather than speed up the code. Perhaps I am hitting a different bottleneck such as instruction decoding.

Graphics answered 9/11, 2021 at 17:27 Comment(3)
The program counter is special, and isn't renamed onto physical registers in the register file. There's a PC value associated with each instruction in the ROB (somehow) in case OoO exec needs to roll back to that point, or something like that. It's know early, in the front-end during branch-prediction and fetch/decode, unlike data register values. There's no reason a CPU would even want to keep PC in the register file along with data registers. (32-bit ARM having PC as one of the 16 r0..r15 registers is unusual, but I'd guess modern implementations just special case every r15 reference.)Spartan
Your link for the 192-bit limit is for the Cortex A57. Is there reason to think the A72 didn't improve it?Nebo
@NateEldredge you’re right, I’m working with both A57 and A72 and this detail escaped me. However, I’m inclined to believe nothing has changed, from benchmarks I’ve performed earlier. For instance, I just benchmarked a loop containing a sequence of 16 add v0.4s, v1.4s, v2.4s. Given that both F0 and F1 can execute vector additions, each iteration of the loop should run in 8 cycles, but in reality it runs in 11 cycles. The 192-bit limit suggests it should take 10.67 cycles — add the 1/3 cycle for the 64-bit subtraction performed by the loop and it’s exactly 11 cycles.Graphics
B
5

First off, you can further reduce the theoretical cycles to 6 by replacing the first mul with uzp1 and doing the following smull and smlal the other way around: mul, mul, smull, smlal => smull, uzp1, mul, smlal This also heavily reduces the register pressure so that we can do an even deeper unrolling (up to 32 per iteration)

And you don't need v2 coefficents, but you can pack them to the higher part of v1

Let's rule out everything by unrolling this deep and writing it in assembly:

    .arch armv8-a
    .global foo
    .text


.balign 64
.func

// void foo(int32_t *pDst, int32_t *pSrc1, int32_t *pSrc2, intptr_t count);
pDst    .req    x0
pSrc1   .req    x1
pSrc2   .req    x2
count   .req    x3

foo:

// initialize coefficients v0 ~ v1

    stp     d8, d9, [sp, #-16]!

.balign 64
1:
    ldp     q16, q18, [pSrc1], #32
    ldp     q17, q19, [pSrc2], #32
    ldp     q20, q22, [pSrc1], #32
    ldp     q21, q23, [pSrc2], #32
    ldp     q24, q26, [pSrc1], #32
    ldp     q25, q27, [pSrc2], #32
    ldp     q28, q30, [pSrc1], #32
    ldp     q29, q31, [pSrc2], #32

    smull   v2.2d, v17.2s, v16.2s
    smull2  v3.2d, v17.4s, v16.4s
    smull   v4.2d, v19.2s, v18.2s
    smull2  v5.2d, v19.4s, v18.4s
    smull   v6.2d, v21.2s, v20.2s
    smull2  v7.2d, v21.4s, v20.4s
    smull   v8.2d, v23.2s, v22.2s
    smull2  v9.2d, v23.4s, v22.4s
    smull   v16.2d, v25.2s, v24.2s
    smull2  v17.2d, v25.4s, v24.4s
    smull   v18.2d, v27.2s, v26.2s
    smull2  v19.2d, v27.4s, v26.4s
    smull   v20.2d, v29.2s, v28.2s
    smull2  v21.2d, v29.4s, v28.4s
    smull   v22.2d, v31.2s, v20.2s
    smull2  v23.2d, v31.4s, v30.4s

    uzp1    v24.4s, v2.4s, v3.4s
    uzp1    v25.4s, v4.4s, v5.4s
    uzp1    v26.4s, v6.4s, v7.4s
    uzp1    v27.4s, v8.4s, v9.4s
    uzp1    v28.4s, v16.4s, v17.4s
    uzp1    v29.4s, v18.4s, v19.4s
    uzp1    v30.4s, v20.4s, v21.4s
    uzp1    v31.4s, v22.4s, v23.4s

    mul     v24.4s, v24.4s, v0.4s
    mul     v25.4s, v25.4s, v0.4s
    mul     v26.4s, v26.4s, v0.4s
    mul     v27.4s, v27.4s, v0.4s
    mul     v28.4s, v28.4s, v0.4s
    mul     v29.4s, v29.4s, v0.4s
    mul     v30.4s, v30.4s, v0.4s
    mul     v31.4s, v31.4s, v0.4s

    smlal   v2.2d, v24.2s, v1.2s
    smlal2  v3.2d, v24.4s, v1.4s
    smlal   v4.2d, v25.2s, v1.2s
    smlal2  v5.2d, v25.4s, v1.4s
    smlal   v6.2d, v26.2s, v1.2s
    smlal2  v7.2d, v26.4s, v1.4s
    smlal   v8.2d, v27.2s, v1.2s
    smlal2  v9.2d, v27.4s, v1.4s
    smlal   v16.2d, v28.2s, v1.2s
    smlal2  v17.2d, v28.4s, v1.4s
    smlal   v18.2d, v29.2s, v1.2s
    smlal2  v19.2d, v29.4s, v1.4s
    smlal   v20.2d, v30.2s, v1.2s
    smlal2  v21.2d, v30.4s, v1.4s
    smlal   v22.2d, v31.2s, v1.2s
    smlal2  v23.2d, v31.4s, v1.4s

    uzp2    v24.4s, v2.4s, v3.4s
    uzp2    v25.4s, v4.4s, v5.4s
    uzp2    v26.4s, v6.4s, v7.4s
    uzp2    v27.4s, v8.4s, v9.4s
    uzp2    v28.4s, v16.4s, v17.4s
    uzp2    v29.4s, v18.4s, v19.4s
    uzp2    v30.4s, v20.4s, v21.4s
    uzp2    v31.4s, v22.4s, v23.4s

    subs    count, count, #32

    stp     q24, q25, [pDst], #32
    stp     q26, q27, [pDst], #32
    stp     q28, q29, [pDst], #32
    stp     q30, q31, [pDst], #32

    b.gt    1b
.balign 16
    ldp     d8, d9, [sp], #16
    ret

.endfunc
.end

The code above has zero latency even in-order. The only thing that could affect the performance would be cache miss penalty.

You can measure the cycles, and if it far exceeds 48 per iteration, there must be something wrong with the chip or the document.
Otherwise, the OoO engine of A72 might be lackluster as Peter pointed out.

PS: Or the load/store ports might be not issuing in parallel on A72. This makes sense given your unrolling experiment.

Blakeslee answered 6/11, 2021 at 5:15 Comment(8)
After an archaeological dig through my code, I confirmed that I had tried a similar approach earlier, but threw it out because it supposedly performed worse. It turns out that clang introduces an extra smull/smull2 pair that simply wasn't there in my intrinsics code ):Graphics
I was able to coerce a trunk version of gcc to generate code that's similar to what's in your answer, although with some differences in instruction scheduling (which I believe the OoO engine should work around). This resulted in an improvement to 8.75 cycles/iteration, but still that's far from 6 cycles/iteration ): Guess I'll try some microbenchmarks to see if I understand exactly where actual core behavior departs from what's described in the optimization manual.Graphics
@swineone: probably worth benchmarking this exact hand-written asm to see whether it's hitting some limitation Jake wasn't expecting. If it's also 8.75 c/iter then it's not GCC's fault. But if the GCC version is slower, then it could be down to instruction scheduling or some other effect of instruction order.Spartan
@Graphics Are you sure the L/S ports issue in parallel with other ones? It would make perfect sense if they don't.Surrealism
@Jake'Alquimista'LEE you may be onto something. I ran some microbenchmarks: 128-bit LDP alone has a reciprocal throughput of 2 cycles/instruction, as does 128-bit SIMD MUL. However, interleaving LDP and MUL results in a reciprocal throughput of about 2.4 cycles/instruction, whereas I would expect 2. In principle this shouldn't be an issue, as it would suggest a limit of 384 bits written every 2 cycles, which is exactly the amount written by the LDP/MUL pair.Graphics
@Jake'Alquimista'LEE also, I just microbenchmarked your exact code above, and it worked out to 74 cycles for one run of the loop body, instead of the 48 you were expecting. Normalizing to the measurements I used before (i.e. for processing 4 array elements at a time), it works out to 9.25 cycles/iteration. So indeed there's something fishy. I will keep investigating.Graphics
@Jake'Alquimista'LEE interleaving the uzp1 instructions with the preceding smull/smull2 instructions, and the uzp2` instructions with the preceding smlal/smlal2 instructions, improved the timings from 9.25 cycles/iteration to 8.75 cycles/iteration. Commenting out the stp instructions improved the timings to 8 cycles/iteration. By also commenting out the ldp instructions, the timings were improved to 6.5 cycles/iterations. So you may be correct: A72 doesn't reasonably handle loads/stores in parallel with other ops.Graphics
@Graphics Thanks for the follow up. An interesting case indeed. And your experiments make me think that the OoO engine isn't that reliable, at least on the A72. Could you please try replacing the ldp/stp instructions with ld1/st1 handling four registers? That could also bring some performance gain.Surrealism
G
3

Starting from Jake's code, reducing the unrolling factor by half, changing some of the register allocation, and trying many different variations of load/store instructions (as well as different addressing modes) and instruction scheduling, I finally arrived at the following solution:

    ld1     {v16.4s, v17.4s, v18.4s, v19.4s}, [pSrc1], #64
    ld1     {v20.4s, v21.4s, v22.4s, v23.4s}, [pSrc2], #64

    add     count, pDst, count, lsl #2

    // initialize v0/v1

loop:
    smull   v24.2d, v20.2s, v16.2s
    smull2  v25.2d, v20.4s, v16.4s
    uzp1    v2.4s, v24.4s, v25.4s

    smull   v26.2d, v21.2s, v17.2s
    smull2  v27.2d, v21.4s, v17.4s
    uzp1    v3.4s, v26.4s, v27.4s

    smull   v28.2d, v22.2s, v18.2s
    smull2  v29.2d, v22.4s, v18.4s
    uzp1    v4.4s, v28.4s, v29.4s

    smull   v30.2d, v23.2s, v19.2s
    smull2  v31.2d, v23.4s, v19.4s
    uzp1    v5.4s, v30.4s, v31.4s

    mul     v2.4s, v2.4s, v0.4s
    ldp     q16, q17, [pSrc1]
    mul     v3.4s, v3.4s, v0.4s
    ldp     q18, q19, [pSrc1, #32]
    add     pSrc1, pSrc1, #64

    mul     v4.4s, v4.4s, v0.4s
    ldp     q20, q21, [pSrc2]
    mul     v5.4s, v5.4s, v0.4s
    ldp     q22, q23, [pSrc2, #32]
    add     pSrc2, pSrc2, #64

    smlal   v24.2d, v2.2s, v1.2s
    smlal2  v25.2d, v2.4s, v1.4s
    uzp2    v2.4s, v24.4s, v25.4s
    str     q24, [pDst], #16

    smlal   v26.2d, v3.2s, v1.2s
    smlal2  v27.2d, v3.4s, v1.4s
    uzp2    v3.4s, v26.4s, v27.4s
    str     q25, [pDst], #16

    smlal   v28.2d, v4.2s, v1.2s
    smlal2  v29.2d, v4.4s, v1.4s
    uzp2    v4.4s, v28.4s, v29.4s
    str     q26, [pDst], #16

    smlal   v30.2d, v5.2s, v1.2s
    smlal2  v31.2d, v5.4s, v1.4s
    uzp2    v5.4s, v30.4s, v31.4s
    str     q27, [pDst], #16

    cmp     count, pDst
    b.ne    loop

Note that, although I have carefully reviewed the code, I haven't tested whether it actually works, so there may be something missing that would impact performance. A final iteration of the loop, removing the load insructions, is required to prevent an out-of-bounds memory access; I omitted this to save some space.

Performing a similar analysis as to that of the original question, assuming the code is fully throughput-limited, would suggest that this loop would take 24 cycles. Normalizing to the same metric as used elsewhere (i.e. cycles per 4-element iteration), this would work out to 6 cycles/iteration. Benchmarking the code resulting in 26 cycles per loop execution, or in the normalized metric, 6.5 cycles/iteration. While not the bare minimum supposedly achievable, it comes very close to this.

Some notes for anyone else who stumbles across this question, after scratching their heads about Cortex-A72 performance:

  1. The schedulers (reservation stations) are per-port rather than global (see this article and this block diagram). Unless your code has a very balanced instruction mix among loads, stores, scalar, Neon, branches, etc., then the OoO window will be smaller than you would expect, sometimes very much so. This code in particular is a pathological case for per-port schedulers. since 70% of all instructions are Neon, and 50% of all instructions are multiplications (which only run on the F0 port). For these multiplications, the OoO window is a very anemic 8 instructions, so don't expect the CPU to be looking at the next loop iteration's instructions while executing the current iteration.

  2. Attempting to further reduce the unrolling factor by half results in a large (23%) slowdown. My guess for the cause is the shallow OoO window, due to the per-port schedulers and the high prevalence of instructions bound to port F0, as explained in point 1 above. Without being able to look at the next iteration, there is less parallelism to be extracted, so the code becomes latency- rather than throughput-bound. Thus, it appears that interleaving multiple iterations of a loop is an important optimization strategy to consider for this core.

  3. One must pay attention to the specific addressing mode used for loads. Replacing the immediate post-index addressing mode used in the original code with immediate offset, and then manually performing incrementing the pointers elsewhere, resulted in performance gains, but only for the loads (stores were unaffected). In section 4.5 ("Load/Store Throughput") of the optimization manual, this is hinted in the context of a memory copy routine, but no rationale is given. However, I believe this is explained by point 4 below.

  4. Apparently the main bottleneck of this code is writing to the register file: according to this answer to another SO question, the register file only supports writing 192 bits per cycle. This may explain why loads should avoid the use of addressing modes with writeback (pre- and post-index), as this consumes an extra 64 bits writing the result back to the register file. It's all too easy to exceed this limit while using Neon instructions and vector loads (even more so when using LDP and 2/3/4-register versions of LD1), without the added pressure of writing back the incremented address. Knowing this, I also decide to replace the original subs in Jake's code with a comparison to pDst, since comparisons don't write to the register file -- and this actually improved performance by 1/4 of a cycle.

Interestingly, adding up the number of bits written to the register file during one execution of the loop results in 4992 bits (I have no idea whether writes to PC, specifically by the b.ne instruction, should be included in the tally or not; I arbitrarily chose not to). Given the 192-bit/cycle limit, this works out to a minimum of 26 cycles to write all these results to the register file across the loop. So it appears that the code above can't be made faster by rescheduling instructions alone.

Theoretically it might be possible to shed 1 cycle by switching the addressing mode of the stores to immediate offset, and then including an extra instruction to explicitly increment pDst. For the original code, each of the 4 stores would write 64 bits to pDst, for a total of 256 bits, compared to a single 64-bit write if pDst were explicitly incremented once. Thus, this change would result in saving 192 bits, i.e., 1 cycle's worth of register file writes. I attempted this change, trying to schedule the increments of pSrc1/pSrc2/pDst across many different points of the code, but unfortunately I was only able to slow down rather than speed up the code. Perhaps I am hitting a different bottleneck such as instruction decoding.

Graphics answered 9/11, 2021 at 17:27 Comment(3)
The program counter is special, and isn't renamed onto physical registers in the register file. There's a PC value associated with each instruction in the ROB (somehow) in case OoO exec needs to roll back to that point, or something like that. It's know early, in the front-end during branch-prediction and fetch/decode, unlike data register values. There's no reason a CPU would even want to keep PC in the register file along with data registers. (32-bit ARM having PC as one of the 16 r0..r15 registers is unusual, but I'd guess modern implementations just special case every r15 reference.)Spartan
Your link for the 192-bit limit is for the Cortex A57. Is there reason to think the A72 didn't improve it?Nebo
@NateEldredge you’re right, I’m working with both A57 and A72 and this detail escaped me. However, I’m inclined to believe nothing has changed, from benchmarks I’ve performed earlier. For instance, I just benchmarked a loop containing a sequence of 16 add v0.4s, v1.4s, v2.4s. Given that both F0 and F1 can execute vector additions, each iteration of the loop should run in 8 cycles, but in reality it runs in 11 cycles. The 192-bit limit suggests it should take 10.67 cycles — add the 1/3 cycle for the 64-bit subtraction performed by the loop and it’s exactly 11 cycles.Graphics
G
1

I revisited the problem recently, and although Jake's approach of replacing mul with uzp1 is a definite improvement, I was still curious as to whether I could get closer to the expected 8 cycles/iteration using the original approach.

I implemented a 6-stage software pipeline using C and intrinsics:

loadmul/smull/smull2mulsmlal/smlal2uzp2store

Compiling with clang and using #pragma clang loop unroll_count(6) for the main loop, I was able to achieve ~8.9 cycles per iteration. It is possible that experimenting with ldp/stp and ld1/st1 might yield even better results.

Thus, software pipelining might be a profitable avenue to explore in the Cortex-A72, in order to overcome the short OOO window due to per-port scheduling queues, if most instructions execute on a limited number of ports (or even a single one).

Graphics answered 3/12, 2021 at 6:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.