I'm working on writing several real-time DSP algorithms on Android, so I decided to program the ARM directly in Assembly to optimize everything as much as possible and make the math maximally lightweight. At first I was getting speed benchmarks that didn't make a whole lot of sense so I started reading about pipeline hazards, dual-issue capabilities and so on. I'm still puzzled by some of the numbers I'm getting, so I'm posting them here in hope that someone can shed some light on why I get what I get. In particular, I'm interested in why NEON takes different amounts of time to run calculations on different datatypes even though it claims to do each operation in exactly one cycle. My findings are as follows.
I'm using a very simple loop for benchmarking, and I run it for 2,000,000 iterations. Here's my function:
hzrd_test:
@use received argument an number of iterations in a loop
mov r3 , r0
@come up with some simple values
mov r0, #1
mov r1, #2
@Initialize some NEON registers (Q0-Q11)
vmov.32 d0, r0, r1
vmov.32 d1, r0, r1
vmov.32 d2, r0, r1
...
vmov.32 d21, r0, r1
vmov.32 d22, r0, r1
vmov.32 d23, r0, r1
hzrd_loop:
@do some math
vadd.s32 q0, q0, q1
vadd.s32 q1, q0, q1
vadd.s32 q2, q0, q1
vadd.s32 q3, q0, q1
vadd.s32 q4, q0, q1
vadd.s32 q5, q0, q1
vadd.s32 q6, q0, q1
vadd.s32 q7, q0, q1
vadd.s32 q8, q0, q1
vadd.s32 q9, q0,s q1
vadd.s32 q10, q0, q1
vadd.s32 q11, q0, q1
@decrement loop counter, branch to loop again or return
subs r3, r3, #1
bne hzrd_loop
@return
mov r0, r3
mov pc, lr
Notice the computation operation and datatype specified as vector add (vadd
) and signed 32-bit int (s32
). This operation completes within a certain time (see results table below). According to this ARM Cortex-A8 document and following pages, almost all elementary arithmetic operation in NEON should complete in one cycle, but here's what I'm getting:
vmul.f32 ~62ms vmul.u32 ~125ms vmul.s32 ~125ms vadd.f32 ~63ms vadd.u32 ~29ms vadd.s32 ~30ms
I do them by simply replacing the operations and datatypes of everything in the above loop. Is there a reason vadd.u32
is twice faster than vadd.f32
and vmul.f32
is twice faster than vmul.u32
?
Cheers! = )