Arm Neon Intrinsics vs hand assembly

Asked 22/3, 2012 at 18:48 Answered 14/11, 2018 at 10:11

https://web.archive.org/web/20170227190422/http://hilbert-space.de/?p=22

On this site which is quite dated it shows that hand written asm would give a much greater improvement then the intrinsics. I am wondering if this is the current truth even now in 2012.

So has the compilation optimization improved for intrinsics using gnu cross compiler?

Betseybetsy answered 22/3, 2012 at 18:48 Comment(2)

Hey, my site is not dated. I just have other work to do at the moment. :-) – Eyespot 22/3, 2012 at 20:3

Your site is awesome. I've spent a lot of time there when I was trying to figure this stuff out. – Plumcot 22/3, 2012 at 20:12

My experience is that the intrinsics haven't really been worth the trouble. It's too easy for the compiler to inject extra register unload/load steps between your intrinsics. The effort to get it to stop doing that is more complicated than just writing the stuff in raw NEON. I've seen this kind of stuff in pretty recent compilers (including clang 3.1).

At this level, I find you really need to control exactly what's happening. You can have all kinds of stalls if you do things in just barely the wrong order. Doing it in intrinsics feels like surgery with welder's gloves on. If the code is so performance critical that I need intrinsics at all, then intrinsics aren't good enough. Maybe others have difference experiences here.

Plumcot answered 22/3, 2012 at 19:36 Comment(6)

This matches my experience with ARM/Neon. For x86/SSE and PowerPC/AltiVec the compilers are good enough that SIMD code written with intrinsics is pretty hard to beat with assembler, but the Neon code generation (with gcc at least) does not seem to be anywhere near as good, and it's not hard to beat Neon intrinsics SIMD code by a factor of 2x if you are prepared to hand-code assembler. – Hermetic 22/3, 2012 at 19:45

2x matches my experience, too. We're not talking little tweaks here, and I'm not even that good at it. – Plumcot 22/3, 2012 at 19:46

Ditto - I noticed that a lot of things you can do in assembler to help performance can not be expressed via intrinsics, so unless the compiler is smart enough to do these things (e.g. address register updates) then you're out of luck. – Hermetic 22/3, 2012 at 20:0

Yeah it seems like the compiler isn't smart enough I guess I will stick to Arm Assembly Thanks guys – Betseybetsy 22/3, 2012 at 20:0

One approach might be to code up in intrinsics initially, measure performance, then go to assembler for any routines that still need a further speed boost. – Hermetic 22/3, 2012 at 20:10

I agree completely. Intrinsics aren't worth the effort at all. It's true that you can force-improve intrinsics' code generation if you know very well how Neon works, but then, you don't need intrinsics at all. – Revetment 23/3, 2012 at 5:20

I've had to use NEON intrinsics in several projects for portability. The truth is that GCC doesn't generate good code from NEON intrinsics. This is not a weakness of using intrinsics, but of the GCC tools. The ARM compiler from Microsoft produces great code from NEON intrinsics and there is no need to use assembly language in that case. Portability and practicality will dictate which you should use. If you can handle writing assembly language then write asm. For my personal projects I prefer to write time-critical code in ASM so that I don't have to worry about a buggy/inferior compiler messing up my code.

Update: The Apple LLVM compiler falls in between GCC (worst) and Microsoft (best). It doesn't do great with instruction interleaving nor optimal register usage, but at least it generates reasonable code (unlike GCC in some situations).

Update2: The Apple LLVM compiler for ARMv8 has been improved dramatically. It now does a great job generating ARMv8 code from C and intrinsics.

Phlegmy answered 28/3, 2012 at 23:39 Comment(3)

Any reason not to name the compiler that you've found works well? RVDS? Or something else? – Plumcot 29/3, 2012 at 2:56

The other company is Microsoft. Their ARM compiler is top notch. GNU people don't like to hear how MS tools are superior, but it's the truth. – Phlegmy 29/3, 2012 at 3:28

I use to work with GCC and the optimization of intrinsics is pretty bad. :( I never knew that Microsofts compiler is so good at it. Let me test my codes and see how it is. – Substituent 18/6, 2012 at 6:29

So this question is four years old, now, and still shows up in search results...

In 2016 things are much better.

A lot of simple code that I've transcribed from assembly to intrinsics is now optimised better by the compilers than by me because I'm too lazy to do the pipeline work (for how many different pipelines now?), while the compilers just needs me to pass the right --mtune=.

For complex code where register allocation can get tight, GCC and Clang both can still produce slower than handwritten code by a factor of two... or three(ish). It's mostly on register spills, so you should know from the structure of your code whether that's a risk.

But they both sometimes have disappointing accidents. I'd say that right now that's worth the risk (although I'm paid to take risk), and if you do get hit by something then file a bug. That way things will keep on getting better.

Schoenberg answered 26/7, 2016 at 6:29 Comment(2)

Maybe you are right, the compilers are better these days. But it's still not good enough. It never will be. As I mentioned above, you can write decently performing routines in intrinsics, provided you know NEON, and unfortunatetly, the web is flooded with lackluster NEON examples written in intrinsics, especially AOSPs NEON implementations are a bad joke. It's cerainly because they wrote these codes lightly without reading ARM's technical reference manual. – Revetment 27/9, 2017 at 5:28

Status update 2017: my asm 4x4 float matrix multiplication runs almost three times as fast as the intrinsics version, also written by me. (Clang, Android Studio 3.01 built-in, build tool version 27.0.1, ARM mode) Still a pure waste of time. – Revetment 1/12, 2017 at 11:43

By now you even get auto-vectorization for the plain C code and the intrinsics are handled properly: https://godbolt.org/z/AGHupq

Unfathomable answered 14/11, 2018 at 10:11 Comment(0)

Recommended topics

Hot tags