Is there an advantage of specifying "-mfpu=neon-vfpv3" over "-mfpu=neon" for ARMs with separate pipelines?
Asked Answered
D

3

7

My Zynq-7000 ARM Cortex-A9 Processor has both the NEON and the VFPv3 extension and the Zynq-7000-TRM says that the processor is configured to have "Independent pipelines for VFPv3 and advanced SIMD instructions".

So far I compiled my programs with Linaro GCC 6.3-2017.05 and the -mfpu=neon option, to make use of SIMD instructions. But in the case that the compiler also has non-SIMD operations to be issued, will it make a difference to use -mfpu=neon-vfpv3? Will GCC's instruction selection and scheduler emit instructions for both versions, so that it could then make use of both pipelines, to increase utilization of the CPU?

Deceptive answered 12/12, 2017 at 8:54 Comment(6)
You also need -ffast-math for auto-vectorization, because ARMv7 NEON doesn't support denormals, and thus is not fully IEEE compliant. (Or something like that; you need -ffast-math for auto-vectorization, but you can use NEON intrinsics without -ffast-math).Editorial
@PeterCordes yeah, it even vectorized with -funsafe-math-optimizations. But I wonder whether for non-vectorized code, or code that contains both vectorized and non-vectorized variants, performance could be improved by utilizing both pipelines.Deceptive
I only know that kind of microarchitectural detail for x86 CPUs, not ARM. So good question, that's why I upvoted. Hopefully someone will answer. It's slightly similar to x86 -march=sse+387 which can generate some clunky code, but those use different architectural registers. VFP and NEON use the same registers, so there's much more hope for good code-gen.Editorial
@PeterCordes Does this "need -ffast-math for auto-vectorization, because ARMv7 NEON doesn't support denormals, and thus is not fully IEEE compliant" also hold today? Does it hold for armv8 64-bit too?Havard
@Danijel: IIRC, AArch64 SIMD supports denomrals aka subnormals. I forget if ARMv7 ever got a control bit for that, but probably ARMv8 in 32-bit mode has IEEE compliant SIMD.Editorial
OK, so -ffast-math not needed for armv8.Havard
I
4

Technically, yes.

Reality, no.

NEON has been optional on ARMv7.

The licensees can choose one configuration from below:

  • none
  • VFP only
  • NEON plus VFP

Unlike NEON, there has been different VFP versions on ARMv7, the VFP-lite on Cortex-A8 being the most notorious one for not pipelining, thus extremely slow.

Therefore, it technically makes sense to specify the CPU configuration and the architecture version via compiler options so that the compilers can generate the most optimized machine codes for that particular architecture/configuration.

In reality however, the compilers these days ignore most of these build options and even directives in addition.

And that the VFP and NEON instructions are assigned to different pipelines won't help much, if at all since they both share the register bank.

Boosting NEON's performance by utilizing as many registers as possible would bring much more than let the VFP run in parallel instead.

It riddles me why and how so many people put so much trust in free compilers these days.

The best ARM compiler available is hands down ARM's that comes with the $6k+ DS-5 Ultimate Edition. Their support is excellent, but I'm not sure if it justifies the price tag.

Ignominy answered 12/12, 2017 at 11:2 Comment(5)
According to this, Cortex-A9 is "partial out-of-order". Does that still mean it needs all of its architectural registers for software pipelining of NEON instructions? Because if not, then depending on the problem, there might be some spare bandwidth (and physical register-file space) to mix in some VFP, if VFP is not horrible on that CPU. Or does that not make sense? (Agreed that I wouldn't be too optimistic about gcc doing this well, though. I'm more interested in whether optimal asm could take advantage.)Editorial
@PeterCordes I don't think that the A9 executes NEON instructions out-of-order, given to the test results I performed myself. Even the Exynos 8890 on the Galaxy-S7 doesn't seem to properly utilize the out-of-order engine. I'll be posting these tests on my blog soon, and you will be the first one to be notified. That will be the hell of a myth-buster.Brabazon
Does it at least do register-renaming so you can reuse a register without false dependencies? (i.e. avoid WAW and WAR hazards). Just guessing here, but that could be what they mean by "partial out-of-order". In-order with renaming (Tomasulo's algorithm) wouldn't need an expensive scheduler, just a RAT (register allocation table) and a rename stage in the pipeline. (There are also simpler but less powerful renaming schemes, like scoreboarding).Editorial
On the integer core, most probably yes, but from what I've been seeing so far, I even doubt that the big cluster NEON executes out-of-order, at least on the particular Samsung custom design.Brabazon
Hmm, I didn't think through my previous comment. In-order execution doesn't benefit from register renaming enough to justify it, I don't think. If instructions start in-order, they can still write-back out-of-order if some are higher latency, so WAW is a possible hazard, but not WAR. (I think we can assume every instruction reads its operands before the next one writes back). WAW alone is unlikely: most code doesn't write registers they never read, and in-order exec prevents the write from starting before an instruction stuck waiting for its input. And that's all reg renaming would do.Editorial
E
3

ARM's Cortex-A9 NEON/VFP manual (Cortex™ -A9 NEON™ Media Processing Engine) says, in section 3.2 Writing optimal VFP and Advanced SIMD code:

The following guidelines can provide significant performance increases for VFP and Advanced SIMD code: Where possible avoid:

  • ...

  • mixing Advanced SIMD only instructions with VFP only instructions.

It says it can execute NEON and VFP instructions in parallel with ARM or Thumb instructions (i.e. scalar integer code), "with the exception of simultaneous loads and stores".

It's not 100% clear if they mean avoid having them in flight at once at all, or if they mean avoid having data dependencies between VFP and NEON instructions. It's easy to imagine the latter being bad for reasons that don't apply to the former (e.g. maybe no bypass forwarding between execution units in different domains).


The cycle timings in the same document indicate that VFP scalar instructions take longer in the pipeline than NEON instructions (even if the latency appears to be the same), so probably using VFP is a win for code that doesn't vectorize, even with -ffast-math. Or if I'm reading this right, NEON has lower latency MUL, so may be a win for long dependency chains.

Cortex-A9, if it has VFP, has fully-pipelined VFP FPUs. e.g.

  • VADD/VSUB .F (Sn) or .D (Dn) ((VFP): 1c throughput. Inputs needed on cycle 1, results ready on cycle 4. (So 4c latency?)

  • VADD/VSUB Dn (NEON): 1c throughput. Inputs needed on cycle 2, results ready on cycle 5 (write-back on cycle 6). (So 4c or 5c latency?, depending on what consumes the result).

  • VADD/VSUB Qn (NEON): (1 per) 2c throughput. Inputs needed on cycle 2 then 3, results ready on cycle 5 then 6. (Write-back 1c later than that) (So 4c or 5c latency?).

  • VMUL .F Sd,Sn,Sm (VFP): 1c throughput, Inputs needed on cycle 1, results ready on cycle 5. (So 5c latency?)

  • VMUL (VFP) with double-precision isn't listed, only VNMUL (2c throughput).

  • VMUL (NEON): same timings as VADD/VSUB. Maybe not handling denormals allows a shortcut? If I'm reading this right, it's actually lower latency than VFP, except for the instruction needing to issue earlier.

There's also special result-forwarding for multiply-accumulate. See the PDF.

Editorial answered 12/12, 2017 at 11:37 Comment(1)
vmul and vmla/vmls instructions seem to have sort of switching overheads. There was a question on this a few weeks ago. I performed some tests myself, and it seems to be the case. I never read something like this in any document though.Brabazon
R
3

The answer will depend on the version of gcc, which may change in the future. The current code in cortex-a9.md decribes the NEON/VFP as being a combined unit. The line is,

(define_cpu_unit "ca9_issue_vfp_neon, cortex_a9_ls" "cortex_a9")

With comments,

;; The Cortex-A9 core is modelled as a dual issue pipeline that has
;; the following components.
;; 1. 1 Load Store Pipeline.
;; 2. P0 / main pipeline for data processing instructions.
;; 3. P1 / Dual pipeline for Data processing instructions.
;; 4. MAC pipeline for multiply as well as multiply
;;    and accumulate instructions.
;; 5. 1 VFP and an optional Neon unit.
;; The Load/Store, VFP and Neon issue pipeline are multiplexed.
;; The P0 / main pipeline and M1 stage of the MAC pipeline are
;;   multiplexed.
;; The P1 / dual pipeline and M2 stage of the MAC pipeline are
;;   multiplexed.
;; There are only 4 integer register read ports and hence at any point of
;; time we can't have issue down the E1 and the E2 ports unless
;; of course there are bypass paths that get exercised.
;; Both P0 and P1 have 2 stages E1 and E2.
;; Data processing instructions issue to E1 or E2 depending on
;; whether they have an early shift or not.

And the ca9_issue_vfp_neon unit is used to describe both NEON and VFP instructions. So the scheduler will not know that the instructions can be pipelined when costing them. However, it may emit both and you could be fortunate and they get pipelined.

In 'arm.c', there are many instances where NEON is used to transfer data. If your code has floating point with many structures, the compiler may intermix NEON and VFP code where the NEON is used to move data.

Machines like the exynos have some custom tuning like using neon for string operations that your Zync CPU will not get as it doesn't have a tuning description in arm.c.

Also, if you don't specify -mfpu=neon-vfpv3, any in-line assembler with 'vfpv3' instructions will be invalid.


Things will change depending on the GCC version. However, you can look for the CPU description in 'cortex-a9.md' to see if the compiler can possibly schedule instructions differently. Also, the 'arm.c' file performs the costing for instructions; if a NEON cost is not implemented there, then the compiler will never emit the instructions.

Having struggled with simpler ARMv5 DSP instructions, even if this was to work, you would find that only 1-2% of instructions would change. In multi-megabyte images, an option like this will only change a few hundred op-codes for the reasons that others have given (shared registers, 'C' semantics on floating point, etc).

However, if -mfpu=neon-vfpv3 does describe your CPU why would you not use it for an embedded application? The generic options are meant to generate code that can run on more than one type of device.

Rattlebrain answered 12/12, 2017 at 15:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.