Does ARM sit idle while NEON is doing its operations?

Asked 19/10, 2012 at 6:54 Answered 19/10, 2012 at 7:7

Solved linux embedded arm neon cortex-a8

Might look similar to: ARM and NEON can work in parallel?, but its not, I have some other issue ( may be problem with my understanding):

In the protocol stack, while we compute checksum, that is done on the GPP, I’m handing over that task now to NEON as part of a function:

Here is the checksum function that I have written as a part of NEON, posted in Stack Overflow: Checksum code implementation for Neon in Intrinsics

Now, suppose from linux this function is called,

ip_csum(){
  …
  …
  csum = do_csum(); //function call from arm
  …
  …
}


do_csum(){
  …
  …
  //NEON optimised code
  …
  …
  returns the final checksum to ip_csum/linux/ARM
}

in this case.. what happens to ARM when NEON is doing the calculations? does ARM sit idle? or it moves on with other operations?

as you can see do_csum is called and we are waiting on that result ( or that is what it looks like)..

NOTE:

Speaking in terms of cortex-a8
do_csum as you can see from the link is coded with intrinsics
compilation using gnu tool-chain
Will be good if you also take Multi-threading or any other concept involved or comes into picture when these inter operations happen.

Questions:

Does ARM sit idle while NEON is doing its operations? ( in this particular case)
Or does it shelve this current ip_csum related code, and take up another process/thread till NEON is done? ( I'm almost dumb as to what happens here)
if its sitting idle, how can we make ARM work on something else till NEON is done?

Kamilah answered 19/10, 2012 at 6:54 Comment(5)

ARM units can do useful work in parallel with NEON. I suggest to have a look at this paper as an example of such optimization: cr.yp.to/highspeed/neoncrypto-20120320.pdf – Cachepot 19/10, 2012 at 7:22

We need to see the actual instructions to be able to say what is executing in parallel. Your "NEON optimised" code is probably a mix of NEON and non-NEON instructions. – Steamboat 19/10, 2012 at 7:30

See my answer but you have to look at things at the instruction level, not the thread or function level. – Steamboat 19/10, 2012 at 7:53

Thanks @Maratyszcza, that link was indeed helpful. Here's a question might be silly but does the concept of multi-threading/semaphores/spinlocks come into picture, while ARM invokes code for NEON? – Kamilah 22/10, 2012 at 6:22

Running instructions on ARM and NEON units in parallel has nothing to do about multi-threading. This is an example of instruction-level parallelism - running several independent instructions in the same cycle. Nearly all modern CPUs are able to run several instruction per cycle. However, the instructions running on the same cycle are not as independent for the CPU as instructions from different threads, and in many situations the processors fails to recognize that they are independent. – Cachepot 22/10, 2012 at 8:24

enter image description here

(Image from TI Wiki Cortex A8)

The ARM (or rather the Integer Pipeline) does not sit idle while NEON instructions are processing. In the Cortex A8, the NEON is at the "end" of the processor pipeline, instructions flow through the pipeline and if they are ARM instructions they are executed in the "beginning" of the pipeline and NEON instructions are executed in the end. Every clock pushes the instruction down the pipeline.

Here are some hints on how to read the diagram above:

Every cycle, if possible, the processor fetches an instruction pair (two instructions).
Fetching is pipelined so it takes 3 cycles for the instructions to propagate into the decode unit.
It takes 5 cycles (D0-D4) for the instruction to be decoded. Again this is all pipelines so it affects the latency but not the throughput. More instructions keep flowing through the pipeline where possible.
Now we reach the execute/load store portion. NEON instructions flow through this stage (but they do that while other instructions are possibly executing).
We get to the NEON portion, if the instruction fetched 13 cycles ago was a NEON instruction it is now decoded and executed in the NEON pipeline.
While this is happening, integer instructions that followed that instruction can execute at the same time in the integer pipeline.
The pipeline is a fairly complex beast, some instructions are multi-cycle, some have dependencies and will stall if those dependencies are not met. Other events such as branches will flush the pipeline.

If you are executing a sequence that is 100% NEON instructions (which is pretty rare, since there are usually some ARM registers involved, control flow etc.) then there is some period where the the integer pipeline isn't doing anything useful. Most code will have the two executing concurrently for at least some of the time while cleverly engineered code can maximize performance with the right instructions mix.

This online tool Cycle Counter for Cortex A8 is great for analyzing the performance of your assembly code and gives information about what is executing in what units and what is stalling.

Steamboat answered 19/10, 2012 at 7:7 Comment(5)

NEON ALUs are not necessarily at the back of the pipeline, nor do they necessarily have separate decode/dispatch. They are on the Cortex-A8, but on many other implementations they are in the same location in the pipeline as other ALUs, and share decode and dispatch. (Upvoted anyway, since the question specifies Cortex-A8). – Blende 21/10, 2012 at 0:43

@StephenCanon: Thanks... I clarified this in my answer. – Steamboat 21/10, 2012 at 1:22

Thanks. Here's a question might be silly but does the concept of multi-threading/semaphores/spinlocks come into picture/can we drag it into picture, while ARM invokes code for NEON? – Kamilah 22/10, 2012 at 6:25

@Asif No. See my answer below. – Liebowitz 22/10, 2012 at 8:24

@Asif: things like context switching take hundreds of cycles so it doesn't really come into the picture. In theory you could wait on a spinlock while processing NEON instructions, you would need to structure your code that way though. It's all about the instruction mix read by the processor at the lowest level. – Steamboat 22/10, 2012 at 17:27

In Application Level Programmers’ Model, you can't really distinguish between ARM and NEON units.

While NEON being a separate hardware unit (that is available as an option on Cortex-A series processors), it is the ARM core who drives it in a tight fashion. It is not a separate DSP which you can communicate in an asynchronous fashion.

You can write better code by fully utilizing pipelines on both units, but this is not same as having a separate core.

NEON unit is there because it can do some operations (SIMDs) much faster than ARM unit at a low frequency.

This is like having a friend who is good at math, whenever you have a hard question you can ask him. While waiting for an answer you can do some small things like if answer is this I should do this or if not instead do that but if you depend on that answer to go on, you need to wait for him to answer before going further. You could calculate the answer yourself but it will be much faster even including the communication time between two of you compared to doing all the math yourself. I think you can even extend this analogy like "you also need to buy some lunch to that friend (energy consumption) but in many cases it worths it".

Anyone who is saying ARM core can do other things while NEON core is working on its stuff is talking about instruction-level parallelism not anything like task-level parallelism.

Liebowitz answered 19/10, 2012 at 7:3 Comment(4)

technically the ARM core does not "drive" the NEON unit, NEON instructions are just processed later down the pipeline. – Steamboat 19/10, 2012 at 7:13

Yes, but in one sense, you need to set values for NEON unit invfp registers right? and you need to get them back too. even when neon does the loading, you call neon instructions in a loop for example, all those from application level programmers point of view is quite coupled. – Liebowitz 19/10, 2012 at 7:20

yes, the NEON unit has no control flow instructions so it's limited in that way. I guess it's more accurate to say the ARM controls the execution flow... – Steamboat 19/10, 2012 at 7:27

Thanks @auselen, How I wish I could accept multiple answers on SO..!!! Until then +1 ;) – Kamilah 22/10, 2012 at 8:53

ARM is not "idle" while NEON operations are executed, but controls them.
To fully use the power of both units, one can carefully plan an interleaved sequence of operations:

loop:
SUBS r0,r0,r1  ; // ARM operation
addpq.16 q0,q0,q1  ; NEON operation
LDR r0, [r1, r2 LSL #2];   // ARM operation
vld1.32 d0, [r1]!  ; // NEON operation using ARM register
bne loop;         // ARM operation controlling the flow of both units...

ARM cortex-A8 can execute in each clock cycle up to 2 instructions. If both of them are independent NEON operations, it's no use to put an ARM instruction in between. OTOH if one knows that the latency of a VLD (load) is large, one can place many ARM instruction in between the load and first use of the loaded value. But in each case the combined usage must be planned in advance and interleaved.

Comate answered 19/10, 2012 at 7:3 Comment(3)

Don't remember if Cortex-A8 specifically refers to Texas implementation or not -- but in Texas chips there is also an independent DSP unit that runs fully parallel to arm/neon. – Comate 19/10, 2012 at 7:11

Cortex A8 is the IP from ARM. TI chips such as the OMAP3730 and various others include a C64x DSP. – Steamboat 19/10, 2012 at 7:15

The Cortex-A8 can execute two integer instructions + two NEON instructions (e.g. one load/store and one arithmetic) during the same cycle. It can only fetch two instructions per cycle however but a single instruction can take more than one cycle. – Steamboat 19/10, 2012 at 7:16

Recommended topics

Hot tags