ARM and NEON can work in parallel?
Asked Answered
P

1

7

This is with reference to question: Checksum code implementation for Neon in Intrinsics

Opening the sub-questions listed in the link as separate individual questions. As multi questions aren't to be asked as a part of single thread.

Anyway coming to the question:

Can ARM and NEON (speaking in terms of arm cortex-a8 architecture) actually work in parallel? How can I achieve this?

Could someone point to me or share some sample implementations(pseudo-code/algorithms/code, not the theoretical implementation papers or talks) which uses the inter-operations of ARM-NEON together? (implementations either with intrinsics or inline-asm will do.)

Pyrites answered 5/9, 2012 at 8:37 Comment(3)
Short answer: Yes, as long as they do not share the same memory. Syncing memory access is very slow, and moving bytes from NEON registers to ARM registers is again very slow. What is reasonably fast is moving ARM regs into NEON regsGilstrap
Note that it's only instruction level parallelismKinny
Also be aware that the NEON unit has a fairly long pipeline and control or register transfers between the vector and unit and integer units carry a fairly large penalty.Idioplasm
M
14

The answer depends on the ARM CPU. The Cortex-A8, for example, uses a coprocessor to implement the NEON and VFP instructions, which is connected to the ARM core via a FIFO. When the instruction decoder detects a NEON or VFP instruction, it simply places it into the fifo. The NEON coprocessor fetches instructions from the FIFO and executes them. The NEON/VFP coprocessor thus lags behind a bit - on the Cortext-A8 up to 20 cycles or so.

Usually, that delay doesn't care about that delay, unless you attempt to transfer data back from the NEON/VFP coprocessor to the main ARM core. (It doesn't matter much whether you do that by moving from a NEON/VPF into an ARM register, or by reading memory using ARM instructions that has recently been written to by NEON instructions). In that case, the main ARM core is stalled until the NEON core has emptied the FIFO, i.e. up to 20 cycles or so.

The ARM core can usually enqueue NEON/VPF instructions faster than the NEON/VPF coprocessor can execute them. You can exploit that to have both cores work in parallel by suitable interleaving your instructions. E.g., insert one ARM instruction after every block of two or three NEON instructions. Or maybe two ARM instructions if you also want to exploit ARM's dual-issue capability. You will have to use inline assembly to do this - if you use intrinsics, the exact scheduling of the instructions is up to the compiler, and whether it has the smarts to interleave them suitably is anybody's guess. Your code will look something like

<neon instruction>
<neon instruction>
<neon instruction>
<arm instruction>
<arm instruction>
<neon instruction>
...

I don't have a code sample at hand, but if you're somewhat familiar with ARM assembly, interleaving the instructions shouldn't be much of a challenge. After you're done, be sure to use an instruction-level profiler to check that things actually work as intended. You should see virtually no time spent on the ARM instructions.

Remember that other ARMv7 implementations might implement NEON completely different. It seems, for example, that the Cortex A-9 has moved NEON closer to the ARM core, and has a much lower penatly on data movements from NEON/VFP back to ARM. Whether or not this affects parallel scheduling of instructions I do not know, but it's definitely something to watch out for.

Microsporophyll answered 5/9, 2012 at 15:52 Comment(1)
Thanks @fgp, that was indeed a very good explanation. Answers lot of my other questions. I'd be grateful, if anyone could point me to some implementations exploiting the features of arm-neon using inline-assembly. I couldn't find much googling.Pyrites

© 2022 - 2024 — McMap. All rights reserved.