The answer will depend on the version of gcc, which may change in the future. The current code in cortex-a9.md decribes the NEON/VFP as being a combined unit. The line is,
(define_cpu_unit "ca9_issue_vfp_neon, cortex_a9_ls" "cortex_a9")
With comments,
;; The Cortex-A9 core is modelled as a dual issue pipeline that has
;; the following components.
;; 1. 1 Load Store Pipeline.
;; 2. P0 / main pipeline for data processing instructions.
;; 3. P1 / Dual pipeline for Data processing instructions.
;; 4. MAC pipeline for multiply as well as multiply
;; and accumulate instructions.
;; 5. 1 VFP and an optional Neon unit.
;; The Load/Store, VFP and Neon issue pipeline are multiplexed.
;; The P0 / main pipeline and M1 stage of the MAC pipeline are
;; multiplexed.
;; The P1 / dual pipeline and M2 stage of the MAC pipeline are
;; multiplexed.
;; There are only 4 integer register read ports and hence at any point of
;; time we can't have issue down the E1 and the E2 ports unless
;; of course there are bypass paths that get exercised.
;; Both P0 and P1 have 2 stages E1 and E2.
;; Data processing instructions issue to E1 or E2 depending on
;; whether they have an early shift or not.
And the ca9_issue_vfp_neon
unit is used to describe both NEON and VFP instructions. So the scheduler will not know that the instructions can be pipelined when costing them. However, it may emit both and you could be fortunate and they get pipelined.
In 'arm.c', there are many instances where NEON is used to transfer data. If your code has floating point with many structures, the compiler may intermix NEON and VFP code where the NEON is used to move data.
Machines like the exynos have some custom tuning like using neon for string operations that your Zync CPU will not get as it doesn't have a tuning description in arm.c.
Also, if you don't specify -mfpu=neon-vfpv3
, any in-line assembler with 'vfpv3' instructions will be invalid.
Things will change depending on the GCC version. However, you can look for the CPU description in 'cortex-a9.md' to see if the compiler can possibly schedule instructions differently. Also, the 'arm.c' file performs the costing for instructions; if a NEON cost is not implemented there, then the compiler will never emit the instructions.
Having struggled with simpler ARMv5 DSP instructions, even if this was to work, you would find that only 1-2% of instructions would change. In multi-megabyte images, an option like this will only change a few hundred op-codes for the reasons that others have given (shared registers, 'C' semantics on floating point, etc).
However, if -mfpu=neon-vfpv3
does describe your CPU why would you not use it for an embedded application? The generic options are meant to generate code that can run on more than one type of device.
-ffast-math
for auto-vectorization, because ARMv7 NEON doesn't support denormals, and thus is not fully IEEE compliant. (Or something like that; you need-ffast-math
for auto-vectorization, but you can use NEON intrinsics without-ffast-math
). – Editorial-funsafe-math-optimizations
. But I wonder whether for non-vectorized code, or code that contains both vectorized and non-vectorized variants, performance could be improved by utilizing both pipelines. – Deceptive-march=sse+387
which can generate some clunky code, but those use different architectural registers. VFP and NEON use the same registers, so there's much more hope for good code-gen. – Editorialarmv8
64-bit too? – Havard-ffast-math
not needed forarmv8
. – Havard