Why ARM NEON not faster than plain C++?

Asked 20/4, 2011 at 12:7 Answered 2/11, 2011 at 13:2

Here is a C++ code:

#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )

void cpp_tst_add( unsigned* x, unsigned* y )
{
    for ( register int i = 0; i < ARR_SIZE_TEST; ++i )
    {
        x[ i ] = x[ i ] + y[ i ];
    }
}

Here is a neon version:

void neon_assm_tst_add( unsigned* x, unsigned* y )
{
    register unsigned i = ARR_SIZE_TEST >> 2;

    __asm__ __volatile__
    (
        ".loop1:                            \n\t"

        "vld1.32   {q0}, [%[x]]             \n\t"
        "vld1.32   {q1}, [%[y]]!            \n\t"

        "vadd.i32  q0 ,q0, q1               \n\t"
        "vst1.32   {q0}, [%[x]]!            \n\t"

        "subs     %[i], %[i], $1            \n\t"
        "bne      .loop1                    \n\t"

        : [x]"+r"(x), [y]"+r"(y), [i]"+r"(i)
        :
        : "memory"
    );
}

Test function:

void bench_simple_types_test( )
{
    unsigned* a = new unsigned [ ARR_SIZE_TEST ];
    unsigned* b = new unsigned [ ARR_SIZE_TEST ];

    neon_tst_add( a, b );
    neon_assm_tst_add( a, b );
}

I have tested both variants and here are a report:

add, unsigned, C++       : 176 ms
add, unsigned, neon asm  : 185 ms // SLOW!!!

I also tested other types:

add, float,    C++       : 571 ms
add, float,    neon asm  : 184 ms // FASTER X3!

THE QUESTION: Why neon is slower with 32-bit integer types?

I used last version of GCC for Android NDK. NEON optimization flags were turned on. Here is a disassembled C++ version:

                 MOVS            R3, #0
                 PUSH            {R4}

 loc_8
                 LDR             R4, [R0,R3]
                 LDR             R2, [R1,R3]
                 ADDS            R2, R4, R2
                 STR             R2, [R0,R3]
                 ADDS            R3, #4
                 CMP.W           R3, #0x2000000
                 BNE             loc_8
                 POP             {R4}
                 BX              LR

Here is disassembled version of neon:

                 MOV.W           R3, #0x200000
.loop1
                 VLD1.32         {D0-D1}, [R0]
                 VLD1.32         {D2-D3}, [R1]!
                 VADD.I32        Q0, Q0, Q1
                 VST1.32         {D0-D1}, [R0]!
                 SUBS            R3, #1
                 BNE             .loop1
                 BX              LR

Here is all bench tests:

add, char,     C++       : 83  ms
add, char,     neon asm  : 46  ms FASTER x2

add, short,    C++       : 114 ms
add, short,    neon asm  : 92  ms FASTER x1.25

add, unsigned, C++       : 176 ms
add, unsigned, neon asm  : 184 ms SLOWER!!!

add, float,    C++       : 571 ms
add, float,    neon asm  : 184 ms FASTER x3

add, double,   C++       : 533 ms
add, double,   neon asm  : 420 ms FASTER x1.25

THE QUESTION: Why neon is slower with 32-bit integer types?

Dud answered 20/4, 2011 at 12:7 Comment(9)

@Cody there's a question in the subject, maybe that? – Damalus 20/4, 2011 at 13:20

Is the C++ faster for all integer types? I think your assembly just isn't as optimal as you'd hoped for integer types. – Hohenstaufen 20/4, 2011 at 13:25

The question is why neon is slower in 32-bit integer types? – Dud 20/4, 2011 at 13:32

@Hohenstaufen I have updated bench report for all types. – Dud 20/4, 2011 at 13:39

I don't know ARM assembler at all but it appears to me as though you're running through the loop 4 times more often than the C++ version. Yes? – Undershot 20/4, 2011 at 15:51

@Undershot No. Exactly vice versa. – Dud 20/4, 2011 at 16:18

This is indeed strange. But something's not quite right -- in your source code, ARR_SIZE_TEST is a hefty 8 million, but in the assembly output, it seems to be 0x12C0/4 = 1200. Why is that? Your timings would carry more weight for the larger value. – Ruin 20/4, 2011 at 17:14

For those who are confused: NEON is a SIMD extension for ARM that allows 128-bit operations, i.e. 4 32-bit operations at a time. One would expect it to be faster than non-SIMD instructions in all cases. arm.com/products/processors/technologies/neon.php – Masked 20/4, 2011 at 17:17

@Ruin I'm sorry, it were old disassembled samples. I've fixed asm listings. – Dud 21/4, 2011 at 7:36

The NEON pipeline on Cortex-A8 is in-order executing, and has limited hit-under-miss (no renaming), so you're limited by memory latency (as you're using more than L1/L2 cache size). Your code has immediate dependencies on the values loaded from memory, so it'll stall constantly waiting for memory. This would explain why the NEON code is slightly (by a tiny amount) slower than non-NEON.

You need to unroll the assembly loops and increase the distance between load and use, e.g:

vld1.32   {q0}, [%[x]]!
vld1.32   {q1}, [%[y]]!
vld1.32   {q2}, [%[x]]!
vld1.32   {q3}, [%[y]]!
vadd.i32  q0 ,q0, q1
vadd.i32  q2 ,q2, q3
...

There's plenty of neon registers so you can unroll it a lot. Integer code will suffer the same issue, to a lesser extent because A8 integer has better hit-under-miss instead of stalling. The bottleneck is going to be memory bandwidth/latency for benchmarks so large compared to L1/L2 cache. You might also want to run the benchmark at smaller sizes (4KB..256KB) to see effects when data is cached entirely in L1 and/or L2.

Yeomanly answered 20/4, 2011 at 17:7 Comment(1)

Thanks for reply. I have unrolled a loop by using 16 128-bit registers in one iteration. It speed-up 32-bit integer. Now time is: add, unsigned, C++ : 180 ms add, unsigned, neon asm : 117 ms – Dud 21/4, 2011 at 11:12

Although you're limited by latency to main-memory in this case it's not exactly obvious that the NEON version would be slower than the ASM version.

Using the cycle calculator here:

http://pulsar.webshaker.net/ccc/result.php?lng=en

Your code should take 7 cycles before the cache miss penalties. It's slower than you may expect because you're using unaligned loads and because of latency between the add and the store.

Meanwhile, the compiler generated loop takes 6 cycles (it's not very well scheduled or optimized in general either). But it's doing one fourth as much work.

The cycle counts from the script might not be perfect, but I don't see anything that looks blatantly wrong with it so I think they'd at least be close. There's potential for taking an extra cycle on the branch if you max out fetch bandwidth (also if the loops aren't 64-bit aligned), but in this case there are plenty of stalls to hide that.

The answer isn't that integer on Cortex-A8 has more opportunities to hide latency. In fact, it normally has less, because of NEON's staggered pipeline and issue queue. Of course, this is only true on Cortex-A8 - on Cortex-A9 the situation may well be reversed (NEON is dispatched in-order and in parallel with integer, while integer has out-of-order capabilities). Since you tagged this Cortex-A8 I'm assuming that's what you're using.

This begs more investigation. Here are some ideas why this could be happening:

You're not specifying any kind of alignment on your arrays, and while I expect new to align to 8-bytes it might not be aligning to 16-bytes. Let's say you really are getting arrays that aren't 16-byte aligned. Then you'd be splitting between lines on cache access which could have additional penalty (especially on misses)
A cache miss happens right after a store; I don't believe Cortex-A8 has any memory disambiguation and therefore must assume that the load could be from the same line as the store, therefore requiring the write buffer to drain before the L2 missing load can happen. Because there's a much bigger pipeline distance between NEON loads (which are initiated in the integer pipeline) and stores (initiated at the end of the NEON pipeline) than integer ones there'd potentially be a longer stall.
Because you're loading 16 bytes per access instead of 4 bytes the critical-word size is larger and therefore the effective latency for a critical-word-first line-fill from main memory is going to be higher (L2 to L1 is supposed to be on a 128-bit bus so shouldn't have the same problem)

You asked what good NEON is in cases like this - in reality, NEON is especially good for these cases where you're streaming to/from memory. The trick is that you need to use preloading in order to hide the main memory latency as much as possible. Preload will get memory into L2 (not L1) cache ahead of time. Here NEON has a big advantage over integer because it can hide a lot of the L2 cache latency, due to its staggered pipeline and issue queue but also because it has a direct path to it. I expect you see effective L2 latency down to 0-6 cycles and less if you have less dependencies and don't exhaust the load queue, while on integer you can be stuck with a good ~16 cycles that you can't avoid (probably depends on the Cortex-A8 though).

So I would recommend that you align your arrays to cache-line size (64 bytes), unroll your loops to do at least one cache-line at a time, use aligned loads/stores (put :128 after the address) and add a pld instruction that loads several cache-lines away. As for how many lines away: start small and keep increasing it until you no longer see any benefit.

Chartography answered 30/5, 2011 at 17:36 Comment(4)

This isn't due to unaligned loads - that wouldn't explain the huge difference, especially as the integer is unaligned too. Cortex-A8 does have disambiguation and will allow several load/store misses. The root cause is that the A8 NEON pipeline doesn't have hit-under-miss, so you need to unroll loops. – Yeomanly 14/6, 2011 at 2:34

The integer pipeline doesn't have hit under miss either. NEON, on the other hand, can fill its load queue out of order (before the NEON pipeline begins), which allows it to hit L1 while an L2 miss is being serviced. The integer stores wouldn't be unaligned because malloc won't return memory not aligned by 4 bytes. Therefore no integer stores will cross cache line boundaries. But the root cause of this being slower than the integer version isn't due to lack of unrolling, because the integer version isn't unrolled either. – Chartography 16/6, 2011 at 17:54

One other reasonable question is if the source and destination are overlapping (particularly if they're the same). I doubt NEON has any kind of store to load forwarding, which would be a big round-trip, bigger than it is for integer. – Chartography 16/6, 2011 at 18:0

I think there is nothing related with aligning. The substring of the neon instruction automatically helps in aligning the data in the cache. Help me if I am wrong. :) – Monarchism 4/11, 2011 at 14:37

Your C++ code isn't optimized either.

#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )

void cpp_tst_add( unsigned* x, unsigned* y )
{
    unsigned int i = ARR_SIZE_TEST;
    do
    {
        *x++ += *y++;
    } (while --i);
}

this version consumes 2 less cycles/iteration.

Besides, your benchmark results don't surprise me at all.

32bit :

This function is too simple for NEON. There aren't enough arithmetic operations leaving any room for optimizations.

Yes, it's so simple that both C++ and NEON version suffer from pipeline hazards almost every time without any real chance of benefitting from the dual issue capabilities.

While NEON version might benefit from processing 4 integers at once, it suffers much more from every hazard as well. That's all.

8bit :

ARM is VERY slow reading each byte from memory. Which means, while NEON shows the same characteristics as with 32bit, ARM is lagging heavily.

16bit : The same here. Except ARM's 16bit read isn't THAT bad.

float : The C++ version will compile into VFP codes. And there isn't a full VFP on Coretex A8, but VFP lite which doesn't pipeline anything which sucks.

It's not that NEON is behaving strangely processing 32bit. It's just ARM that meets the ideal condition. Your function is very inappropriate for benchmarking purpose due to its simpleness. Try something more complex like YUV-RGB conversion :

FYI, my fully optimized NEON version runs roughly 20 times as fast than my fully optimized C version and 8 times as fast than my fully optimized ARM assembly version. I hope that will give you some idea how powerful NEON can be.

Last but not least, the ARM instruction PLD is NEON's best friend. Placed properly, it will bring at least 40% performance boost.

Goraud answered 2/11, 2011 at 13:2 Comment(6)

Your benchmark values seems interesting.! Did you mention that numbers for YUV-RGB conversion? 7-8times faster is what I get. 20 times is pretty interesting! – Monarchism 4/11, 2011 at 12:53

@Anoop : Maybe my C version wasn't good enough? :) I forget to mention that it was YUV420, planar Y and packed UV. On packed YUV422, I wouldn't have got that performance boost maybe. Converting a VGA image takes less than 1ms on my iPhone4. – Cockfight 4/11, 2011 at 14:2

I had been learning about NEON for the last couple of months, but had never used the PLD instructions. Your benchmarks were pretty interesting, will update here about the performance boost I get. Btw, am working on beagleboard. – Monarchism 4/11, 2011 at 23:26

PLD, when placed appropriately, will single handedly bring about 40% speed boost assuming you are dealing with data blocks large enough. Just read far ahead. pld [pSrc, #64] is most common at the start of the loop. – Cockfight 5/11, 2011 at 6:21

Thanks for the help. Will be looking forward to it. :) – Monarchism 5/11, 2011 at 21:14

Since that is faster than OP's, I suggest the compiler needs (needed!) some work ;-). All the changes you made should be candidates arising from basic loop analysis of the original, more explicit form. I used to write code like that [esp. do{}while(--n)] because in the 80s, 90's it made a big difference. Then there was a period of years when loops that didn't fit the 'normal' for idiom (as used by OP) would be penalized because the compilers didn't bother to analyze them as much. So I generally stopped doing that stuff. Ques is > 4 years old now, so things are likely different again now. – Hotshot 10/2, 2016 at 16:38

You can try some modification to improve the code.

If you can: - use a third buffer to store results. - try to align datas on 8 bytes.

The code should be something like (sorry I do not know the gcc inline syntax)

.loop1:
 vld1.32   {q0}, [%[x]:128]!
 vld1.32   {q1}, [%[y]:128]!
 vadd.i32  q0 ,q0, q1
 vst1.32   {q0}, [%[z]:128]!
 subs     %[i], %[i], $1
bne      .loop1

As Exophase says you have some pipeline latency. may be your can try

vld1.32   {q0}, [%[x]:128]
vld1.32   {q1}, [%[y]:128]!

sub     %[i], %[i], $1

.loop1:
vadd.i32  q2 ,q0, q1

vld1.32   {q0}, [%[x]:128]
vld1.32   {q1}, [%[y]:128]!

vst1.32   {q2}, [%[z]:128]!
subs     %[i], %[i], $1
bne      .loop1

vadd.i32  q2 ,q0, q1
vst1.32   {q2}, [%[z]:128]!

Finaly, it is clear that you'll saturate the memory bandwidth

You can try to add a small

PLD [%[x], 192]

into your loop.

tell us if it's better...

Enuresis answered 7/6, 2011 at 7:19 Comment(0)

8ms of difference is SO small that you are probably measuring artifacts of the caches or pipelines.

EDIT: Did you try comparing with something like this for types such as float and short etc? I'd expect the compiler to optimize it even better and narrow the gap. Also in your test you do the C++ version first then the ASM version, this can have impact in the performance so I'd write two different programs to be more fair.

for ( register int i = 0; i < ARR_SIZE_TEST/4; ++i )
{
    x[ i ] = x[ i ] + y[ i ];
    x[ i+1 ] = x[ i+1 ] + y[ i+1 ];
    x[ i+2 ] = x[ i+2 ] + y[ i+2 ];
    x[ i+3 ] = x[ i+3 ] + y[ i+3 ];
}

Last thing, in the signature of your function, you use unsigned* instead of unsigned[]. The latter is preferred because the compiler supposes that the arrays do not overlap and is allowed to reorder accesses. Try using the restrict keyword also for even better protection against aliasing.

Souvaine answered 20/4, 2011 at 16:52 Comment(7)

Yes, but why isn't it 2 or 3 times faster? – Ruin 20/4, 2011 at 17:8

Because of memory bandwidth. You are probably going as fast as you can in terms of bus transfers. – Souvaine 20/4, 2011 at 19:55

I'm not an expert, but I'd say you need more complex examples to actually see an advantage, both in terms of amount of work you do with the data (a simple + is not CPU intensive) and the number of operations (several thousand millions instead of several millions). And I'd expect a 10-30% improvement not 200%. – Souvaine 20/4, 2011 at 21:59

200% is realistic for some workloads. The examples are just pathological cases: poor load-use separation, and 100% cache miss. – Yeomanly 21/4, 2011 at 2:46

I don't think it is a matter of workload, its more of some kind of "what you do with the data is not CPU intensive" problem. – Souvaine 21/4, 2011 at 8:5

@Darhuuk And it isn't even optimized. I'd say 1500~2000% is very well possible through loop unrolling, dual-issuing, and scheduling in addition to the cache preload. – Cockfight 4/11, 2011 at 17:15

unsigned arr[] does not imply no-aliasing; compilers treat it identically to unsigned *arr, as required by ISO C++. Only __restrict makes a difference (as an extension in most C++ compilers like gcc and clang). – Clayberg 12/11, 2020 at 0:1

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags