ARM Cortex A8 Benchmarks: can someone help me make sense of these numbers?
Asked Answered
T

2

5

I'm working on writing several real-time DSP algorithms on Android, so I decided to program the ARM directly in Assembly to optimize everything as much as possible and make the math maximally lightweight. At first I was getting speed benchmarks that didn't make a whole lot of sense so I started reading about pipeline hazards, dual-issue capabilities and so on. I'm still puzzled by some of the numbers I'm getting, so I'm posting them here in hope that someone can shed some light on why I get what I get. In particular, I'm interested in why NEON takes different amounts of time to run calculations on different datatypes even though it claims to do each operation in exactly one cycle. My findings are as follows.

I'm using a very simple loop for benchmarking, and I run it for 2,000,000 iterations. Here's my function:

hzrd_test:

    @use received argument an number of iterations in a loop
    mov r3 , r0

    @come up with some simple values
    mov r0, #1
    mov r1, #2

    @Initialize some NEON registers (Q0-Q11)
    vmov.32 d0, r0, r1
    vmov.32 d1, r0, r1
    vmov.32 d2, r0, r1

    ...

    vmov.32 d21, r0, r1
    vmov.32 d22, r0, r1
    vmov.32 d23, r0, r1

hzrd_loop:

    @do some math
    vadd.s32 q0, q0, q1
    vadd.s32 q1, q0, q1
    vadd.s32 q2, q0, q1
    vadd.s32 q3, q0, q1
    vadd.s32 q4, q0, q1
    vadd.s32 q5, q0, q1
    vadd.s32 q6, q0, q1
    vadd.s32 q7, q0, q1
    vadd.s32 q8, q0, q1
    vadd.s32 q9, q0,s q1
    vadd.s32 q10, q0, q1
    vadd.s32 q11, q0, q1

    @decrement loop counter, branch to loop again or return
    subs r3, r3, #1
    bne hzrd_loop

    @return
    mov r0, r3
    mov pc, lr

Notice the computation operation and datatype specified as vector add (vadd) and signed 32-bit int (s32). This operation completes within a certain time (see results table below). According to this ARM Cortex-A8 document and following pages, almost all elementary arithmetic operation in NEON should complete in one cycle, but here's what I'm getting:

vmul.f32 ~62ms
vmul.u32 ~125ms
vmul.s32 ~125ms

vadd.f32 ~63ms
vadd.u32 ~29ms
vadd.s32 ~30ms

I do them by simply replacing the operations and datatypes of everything in the above loop. Is there a reason vadd.u32 is twice faster than vadd.f32 and vmul.f32 is twice faster than vmul.u32?

Cheers! = )

Trooper answered 8/11, 2011 at 17:4 Comment(0)
K
6

Wow, your results are VERY accurate :

  • 32bit integer Q multiply costs 4 cycles while float takes 2.
  • 32bit integer Q add costs 1 cycle while float takes 2.

Nice experiment.

Maybe you already know, but be careful while coding for NEON :

  • do not access memory with ARM while NEON is doing heavy job
  • do not mix VFP instructions with NEON's. (except for those shared ones)
  • do not access S registers.
  • do not transfer from NEON registers to ARM's

All of those above will cause HUGE hiccups.

Good Luck!

PS : I'd rather optimize for A9 instead(slightly different cycle timings) since pretty much all new devices are coming with A9. And the A9 timing chart from ARM is much more readable. :-)

Kamerman answered 8/11, 2011 at 21:21 Comment(4)
Well, from the way you count cycles, my data definitely makes sense, but how do you count them? I'm obviously not reading the documentation correctly, or we're reading different documents.Trooper
We are actually reading the same. You don't have to count. It's just in the chart under "Cycles". ARM is trying to show us too much and succeeded in just confusing us. Look at VMUL(integer, normal), Right from "Qd, Qm, Qn" you see 1 to 4. This means it takes 4 cycles and in right of them you can see what's happening at THAT stage of the execution cycle. (which stage of the pipeline the src operands are expected in, and which stage dst operands are put to)Gan
I'd have put it this way : 4, Qd(5,3), Qn(4,2), Qm(3,1). But it's ARM.Gan
Thanks! = ) I was indeed confused by the document.Trooper
D
4

I'm going to guess (as I don't have my doc links handy) that you're running into pipeline issues. I know that the FPU - err now called the VFPU has a different pipeline length than the CPU does for doing the integer math portion of your loop. I see that the 2nd arithmetic operation is dependent upon the first which will stall either pipeline and possibly expose the differences that your are seeing.

Also, I believe multiply is not a 1 cycle instruction for ints, but a 2-5 cycle depending on the msb of the 2nd value - 2 cycles here due to the small number size which would explain that difference. To verify that, start with a larger multiply number and see if it slows down on the larger size.

I'd also verify that your code all fits in 1 cache page just to eliminate that possibility as well.

I'd also check out the section on Dual execution just above as there are all sorts of pipeline stalls that happen there as well when things are cross dependent.

Dallis answered 8/11, 2011 at 18:38 Comment(3)
I'm not 100% sure I'm reading section 16.6 of the Cortex-A8 TRM (PDF) correctly, but I think you're right about some of the instructions taking more than 1 cycle.Margalo
Michael, how large is one page of the instruction cache? I never paid attention to that so far. It would be good to know.Gan
Honestly, I'm not sure. It varies per implementation - something you'd have to go look up for your particular device. Nice answer BTW - Your answer explains the large differences while mine might explain the smaller differences.Dallis

© 2022 - 2024 — McMap. All rights reserved.