ARM Cortex-A8: Whats the difference between VFP and NEON

Asked 4/11, 2010 at 13:16 Answered 7/2, 2013 at 22:46

In ARM Cortex-A8 processor, I understand what NEON is, it is an SIMD co-processor.

But is VFP(Vector Floating Point) unit, which is also a co-processor, works as a SIMD processor? If so which one is better to use?

I read few links such as -

Link1
Link2.

But not really very clear what they mean. They say that VFP was never intended to be used for SIMD but on Wiki I read the following - "The VFP architecture also supports execution of short vector instructions but these operate on each vector element sequentially and thus do not offer the performance of true SIMD (Single Instruction Multiple Data) parallelism."

It so not so clear what to believe, can anyone elaborate more on this topic?

Blackstock answered 4/11, 2010 at 13:16 Comment(0)

There are quite some difference between the two. Neon is a SIMD (Single Instruction Multiple Data) accelerator processor as part of the ARM core. It means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel. Since there is parallelism inside the Neon, you can get more MIPS or FLOPS out of Neon than you can a standard SISD processor running at the same clock rate.

The biggest benefit of Neon is if you want to execute operation with vectors, i.e. video encoding/decoding. Also it can perform single precision floating point(float) operations in parallel.

VFP is a classic floating point hardware accelerator. It is not a parallel architecture like Neon. Basically it performs one operation on one set of inputs and returns one output. It's purpose is to speed up floating point calculations. It supports single and double precision floating point.

You have 3 possibilities to use Neon:

use intrinsics functions #include "arm_neon.h"
inline the assembly code
let the gcc to do the optimizations for you by providing -mfpu=neon as argument (gcc 4.5 is good on this)

Overplay answered 4/11, 2010 at 13:53 Comment(0)

For armv7 ISA (and variants)

The NEON is a SIMD and parallel data processing unit for integer and floating point data and the VFP is a fully IEEE-754 compatible floating point unit. In particular on the A8, the NEON unit is much faster for just about everything, even if you don't have highly parallel data, since the VFP is non-pipelined.

So why would you ever use the VFP?!

The most major difference is that the VFP provides double precision floating point.

Secondly, there are some specialized instructions that that VFP offers that there are no equivalent implementations for in the NEON unit. SQRT comes to mind, perhaps some type conversions.

But the most important difference not mentioned in Cosmin's answer is that the NEON floating point pipeline is not entirely IEEE-754 compliant. The best description of the differences are in the FPSCR Register Description.

Because it is not IEEE-754 compliant, a compiler cannot generate these instructions unless you tell the compiler that you are not interested in full compliance. This can be done in several ways.

Using an intrinsic function to force NEON usage, for example see the GCC Neon Intrinsic Function List.
Ask the compiler, very nicely. Even newer GCC versions with -mfpu=neon will not generate floating point NEON instructions unless you also specify -funsafe-math-optimizations.

For armv8+ ISA (and variants) [Update]

NEON is now fully IEE-754 compliant, and from a programmer (and compiler's) point of view, there is actually not too much difference. Double precision has been vectorized. From a micro-architecture point of view I kind of doubt they are even different hardware units. ARM does document scalar and vector instructions separately but both are part of "Advanced SIMD."

Guardhouse answered 7/2, 2013 at 22:46 Comment(3)

Another reason to use VFP is when you need double precision since NEON does not support double precession. Even when VFP is not pipelined (e.g. in Cortex A-8) it will be faster than implementing double in software using NEON (I don't even think double-float using NEON would beat VFP). – Hospitaler 19/6, 2015 at 8:20

I can't believe I forgot that in my answer. Thanks! – Guardhouse 19/6, 2015 at 23:3

I just learned that ARM64 NEON does support double. I guess it's basically like SSE2 for x86 then. – Hospitaler 22/6, 2015 at 8:9

Architecturally, VFP (it wasn't called Vector Floating Point for nothing) indeed has a provision for operating on a floating-point vector in a single instruction. I don't think it ever actually executes multiples operations simultaneously (like true SIMD), but it could save some code size. However, if you read the ARM Architecture Reference Manual in the Shark help (as I describe in my introduction to NEON, link 1 in the question), you'll see at section A2.6 that the vector feature of VFP is deprecated in ARMv7 (which is what the Cortex A8 implements), and software should use Advanced SIMD for floating-point vector operations.

Worse yet, in the Cortex A8 implementation, VFP is implemented with a VFP Lite execution unit (read lite as occupying a smaller silicon surface, not as having less features), which means that it's actually slower than on the ARM11, for instance! Fortunately, most single-precision VFP instructions get executed by the NEON unit, but I'm not sure vector VFP operations do; and even if they do, they certainly execute slower than with NEON instructions.

Hope that clears thing up!

Epimenides answered 5/11, 2010 at 9:33 Comment(2)

Hey Pierre, eye opening! But, I could not get what you mean by Shark help, can you kindly post the link? – Blackstock 5/11, 2010 at 10:13

For obtuse reasons, there is no direct link to the ARM architecture documentation. Instead, I point iOS developers to the local copy they already have, at /Library/Application\ Support/Shark/Helpers/ARM\ Help.app/Contents/Resources/ARMISA.pdf (better yet, that document omits information that's obsolete or irrelevant for iOS development, such as system-level information). If you are not an iOS developer, then go to infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html , sign up for an account, accept the conditions, and download the document. – Epimenides 6/11, 2010 at 19:20

IIRC, the VFP is a floating point coprocessor which works sequentially.

This means that you can use instruction on a vector of floats for SIMD-like behaviour, but internally, the instruction is performed on each element of the vector in sequence.

While the overall time required for the instruction is reduced by this because of the single load instruction, the VFP still needs time to process all elements of the vector.

True SIMD will gain more net floating point performance, but using the VFP with vectors is still faster then using it purely sequential.

Rive answered 4/11, 2010 at 13:47 Comment(0)

Recommended topics

Hot tags