Learning about ARM NEON intrinsics, I was timing a function that I wrote to double the elements in an array.The version that used the intrinsics takes more time than a plain C version of the function.
Without NEON :
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int loop;
for( loop= 0; loop<size; loop++)
ptr[loop]<<=1;
return;
}
With NEON:
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int i;
uint32x4_t Q0,vector128Output;
for( i=0;i<(SIZE/4);i++)
{
Q0=vld1q_u32(ptr);
Q0=vaddq_u32(Q0,Q0);
vst1q_u32(ptr,Q0);
ptr+=4;
}
return;
}
Wondering if the load/store operations between the array and vector is consuming more time which offsets the benefit of the parallel addition.
UPDATE:More Info in response to Igor's reply.
1.The code is posted here:
plain.c
plain.s
neon.c
neon.s
From the section(label) L7 in both the assembly listings,I see that the neon version has more number of assembly instructions.(hence more time taken?)
2.I compiled using -mfpu=neon on arm-gcc, no other flags or optimizations.For the plain version, no compiler flags at all.
3.That was a typo, SIZE was meant to be size;both are same.
4,5.Tried on an array of 4000 elements. I timed using gettimeofday() before and after the function call.NEON=230us,ordinary=155us.
6.Yes I printed the elements in each case.
7.Did this, no improvement whatsoever.