Which one is better, gcc or armcc for NEON optimizations?

Asked 25/9, 2012 at 6:49 Answered 26/9, 2012 at 20:28

Refering to @auselen's answer here: Using ARM NEON intrinsics to add alpha and permute, looks like armcc compiler is far more better than the gcc compiler for NEON optimizations. Is this really true? I haven't really tried armcc compiler. But I got pretty optimized code using the gcc compiler with -O3 optimization flag. But now I'm wondering if armcc is really that good? So which of the two compiler is better, considering all the factors?

Karlise answered 25/9, 2012 at 6:49 Comment(2)

The NEON support in gcc is less mature than the scalar integer/fp support. However, auselen's comparison is based on gcc 4.4.3, released over 2.5 years ago. A fair bit of work has gone into NEON improvements since then. At the same time, armcc 5.01 is only a year old. While I would still expect armcc 5.02 to be ahead, a more relevant comparison would be between it and a 4.7 gcc. – Adapter 25/9, 2012 at 9:5

@Adapter one of +1 is by me :) – Goree 25/9, 2012 at 15:8

Compilers are software as well, they tend to improve over time. Any generic claim like armcc is better than GCC on NEON (or better said as vectorization) can't hold true forever since one developer group can close the gap with enough attention. However initially it is logical to expect compilers developed by hardware companies to be superior because they need to demonstrate/market these features.

One recent example I saw was here on Stack Overflow about an answer for branch prediction. Quoting from last line of updated section "This goes to show that even mature modern compilers can vary wildly in their ability to optimize code...".

I am a big fan of GCC, but I wouldn't bet on quality of code produced by it against compilers from Intel or ARM. I expect any mainstream commercial compiler to produce code at least as good as GCC.

One empirical answer to this question could be to use hilbert-space's neon optimization example and see how different compilers optimize it.

void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
  int i;
  uint8x8_t rfac = vdup_n_u8 (77);
  uint8x8_t gfac = vdup_n_u8 (151);
  uint8x8_t bfac = vdup_n_u8 (28);
  n/=8;

  for (i=0; i<n; i++)
  {
    uint16x8_t  temp;
    uint8x8x3_t rgb  = vld3_u8 (src);
    uint8x8_t result;

    temp = vmull_u8 (rgb.val[0],      rfac);
    temp = vmlal_u8 (temp,rgb.val[1], gfac);
    temp = vmlal_u8 (temp,rgb.val[2], bfac);

    result = vshrn_n_u16 (temp, 8);
    vst1_u8 (dest, result);
    src  += 8*3;
    dest += 8;
  }
}

This is armcc 5.01

  20:   f421140d    vld3.8  {d1-d3}, [r1]!
  24:   e2822001    add r2, r2, #1
  28:   f3810c04    vmull.u8    q0, d1, d4
  2c:   f3820805    vmlal.u8    q0, d2, d5
  30:   f3830806    vmlal.u8    q0, d3, d6
  34:   f2880810    vshrn.i16   d0, q0, #8
  38:   f400070d    vst1.8  {d0}, [r0]!
  3c:   e1520003    cmp r2, r3
  40:   bafffff6    blt 20 <neon_convert+0x20>

This is GCC 4.4.3-4.7.1

  1e:   f961 040d   vld3.8  {d16-d18}, [r1]!
  22:   3301        adds    r3, #1
  24:   4293        cmp r3, r2
  26:   ffc0 4ca3   vmull.u8    q10, d16, d19
  2a:   ffc1 48a6   vmlal.u8    q10, d17, d22
  2e:   ffc2 48a7   vmlal.u8    q10, d18, d23
  32:   efc8 4834   vshrn.i16   d20, q10, #8
  36:   f940 470d   vst1.8  {d20}, [r0]!
  3a:   d1f0        bne.n   1e <neon_convert+0x1e>

Which looks extremely similar, so we have a draw. After seeing this I tried mentioned add alpha and permute again.

void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
{
    numPix /= 8; //process 8 pixels at a time

    uint8x8_t alpha = vdup_n_u8 (0xff);

    for (int i=0; i<numPix; i++)
    {
        uint8x8x3_t rgb  = vld3_u8 (src);
        uint8x8x4_t bgra;

        bgra.val[0] = rgb.val[2]; //these lines are slow
        bgra.val[1] = rgb.val[1]; //these lines are slow 
        bgra.val[2] = rgb.val[0]; //these lines are slow

        bgra.val[3] = alpha;

        vst4_u8(dst, bgra);

        src += 8*3;
        dst += 8*4;
    }
}

Compiling with gcc...

$ arm-linux-gnueabihf-gcc --version
arm-linux-gnueabihf-gcc (crosstool-NG linaro-1.13.1-2012.05-20120523 - Linaro GCC 2012.05) 4.7.1 20120514 (prerelease)
$ arm-linux-gnueabihf-gcc -std=c99 -O3 -c ~/temp/permute.c -marm -mfpu=neon-vfpv4 -mcpu=cortex-a9 -o ~/temp/permute_gcc.o

00000000 <neonPermuteRGBtoBGRA>:
   0:   e3520000    cmp r2, #0
   4:   e2823007    add r3, r2, #7
   8:   b1a02003    movlt   r2, r3
   c:   e92d01f0    push    {r4, r5, r6, r7, r8}
  10:   e1a021c2    asr r2, r2, #3
  14:   e24dd01c    sub sp, sp, #28
  18:   e3520000    cmp r2, #0
  1c:   da000019    ble 88 <neonPermuteRGBtoBGRA+0x88>
  20:   e3a03000    mov r3, #0
  24:   f460040d    vld3.8  {d16-d18}, [r0]!
  28:   eccd0b06    vstmia  sp, {d16-d18}
  2c:   e59dc014    ldr ip, [sp, #20]
  30:   e2833001    add r3, r3, #1
  34:   e59d6010    ldr r6, [sp, #16]
  38:   e1530002    cmp r3, r2
  3c:   e59d8008    ldr r8, [sp, #8]
  40:   e1a0500c    mov r5, ip
  44:   e59dc00c    ldr ip, [sp, #12]
  48:   e1a04006    mov r4, r6
  4c:   f3c73e1f    vmov.i8 d19, #255   ; 0xff
  50:   e1a06008    mov r6, r8
  54:   e59d8000    ldr r8, [sp]
  58:   e1a0700c    mov r7, ip
  5c:   e59dc004    ldr ip, [sp, #4]
  60:   ec454b34    vmov    d20, r4, r5
  64:   e1a04008    mov r4, r8
  68:   f26401b4    vorr    d16, d20, d20
  6c:   e1a0500c    mov r5, ip
  70:   ec476b35    vmov    d21, r6, r7
  74:   f26511b5    vorr    d17, d21, d21
  78:   ec454b34    vmov    d20, r4, r5
  7c:   f26421b4    vorr    d18, d20, d20
  80:   f441000d    vst4.8  {d16-d19}, [r1]!
  84:   1affffe6    bne 24 <neonPermuteRGBtoBGRA+0x24>
  88:   e28dd01c    add sp, sp, #28
  8c:   e8bd01f0    pop {r4, r5, r6, r7, r8}
  90:   e12fff1e    bx  lr

Compiling with armcc...

$ armcc
ARM C/C++ Compiler, 5.01 [Build 113]
$ armcc --C99 --cpu=Cortex-A9 -O3 -c permute.c -o permute_arm.o

00000000 <neonPermuteRGBtoBGRA>:
   0:   e1a03fc2    asr r3, r2, #31
   4:   f3870e1f    vmov.i8 d0, #255    ; 0xff
   8:   e0822ea3    add r2, r2, r3, lsr #29
   c:   e1a031c2    asr r3, r2, #3
  10:   e3a02000    mov r2, #0
  14:   ea000006    b   34 <neonPermuteRGBtoBGRA+0x34>
  18:   f420440d    vld3.8  {d4-d6}, [r0]!
  1c:   e2822001    add r2, r2, #1
  20:   eeb01b45    vmov.f64    d1, d5
  24:   eeb02b46    vmov.f64    d2, d6
  28:   eeb05b40    vmov.f64    d5, d0
  2c:   eeb03b41    vmov.f64    d3, d1
  30:   f401200d    vst4.8  {d2-d5}, [r1]!
  34:   e1520003    cmp r2, r3
  38:   bafffff6    blt 18 <neonPermuteRGBtoBGRA+0x18>
  3c:   e12fff1e    bx  lr

In this case armcc produces much better code. I think this justifies fgp's answer above. Most of the time GCC will produce good enough code, but you should keep an eye on critical parts or most importantly first you must measure / profile.

Goree answered 26/9, 2012 at 20:28 Comment(2)

Try using '-marm' flag in GCC, thumb code isn't so mature in GCC as of yet, even more so for thumb2 unit in Cortex-A9. – Forgave 28/9, 2012 at 5:12

Hmm, as expected, register spilling is quite extensive with gcc. – Forgave 28/9, 2012 at 19:17

If you use NEON intrinsics, the compiler shouldn't matter that much. Most (if not all) NEON intrinsics translate to a single NEON instruction, so the only thing left to the compiler is register allocation and instruction scheduling. In my experience, both GCC 4.2 and Clang 3.1 do reasonably well at those tasks.

Note, however, that the NEON instructions are bit more expressive than the NEON instrinsics. For example, NEON load/store instructions have pre- and post-increment adressing modes which combine a load or store with an increment of the address register, thus saving you one instruction. The NEON intrinsics don't provide an explicit way to do that, but instead rely on the compiler to combine a reguler NEON load/store intrinsic and an address increment into a load/store instruction with post-increment. Similarly, some load/store instructions allow you to specify the alignment of the memory address, and execute faster if you specify stricter alignment guarantees. The NEON intrinsics, again, don't allow you to specify alignment explicitly, but instead rely on the compiler to deduce the correct alignment specifier. In theory, you use "align" attributes on your pointers to provide suitable hints to the compiler, but at least Clang seems to ignore those...

In my experience, neither Clang nor GCC are very bright when it comes to those kinds of optimizations. Fortunately, the additional performance benefit of these kinds of optimization usually isn't all that high - it's more like 10% than 100%.

Another area where those two compilers aren't particularly smart is avoidance of stack spilling. If you code uses more vector-valued variables than there are NEON registers, I've seem both compilers produce horrible code. Basically, what they seem to do is to schedule instructions based on the assumption that there are enough registers available. Register allocation seems to come afterwards, and seems to simply spill values to the stack once it runs of registers. So make sure you code has a working set of less than 16 128-bit vectors or 32 64-bit vectory at any time!

Overall, I've got pretty good results from both GCC and Clang, but I regularly had to reorganize the code a bit to avoid compiler Idiosyncrasies. My advice would be to stick with GCC or Clang, but check on the regularly with the dissassembler of your choice.

So, overall, I'd say sticking with GCC is fine. You might want to look at the dissassembly of the performance-critical parts, though, and check if it looks reasonable.

Rigmarole answered 25/9, 2012 at 12:8 Comment(0)

Recommended topics

Hot tags