Checksum code implementation for Neon in Intrinsics
Asked Answered
M

1

1

I'm trying to implement the checksum computation code(2's complement addition) for NEON, using intrinsic. The current checksum computation is being carried out on ARM.

My implementation fetches 128-bits at once from the memory into NEON registers and does SIMD (addition), and result is folded to a 16-bit number from a 128-bit number.

Everything looks to be working fine, but my NEON implementation is consuming more time that of the ARM version.

ARM version takes: 0.860000 s NEON version takes: 1.260000 s

Note:

  1. Profiled using utilities from "time.h"
  2. The checksum function called 10,000 times from a sample application, and time computed after complete run of all the functions

Other details:

  1. Used GNU tool-chain(arm-none-linux-gnueabi-gcc) for compiling the intrinsic code and not arm tool-chain.
  2. Linux platform.
  3. C-intrinsic code.

Questions:

  1. Why does NEON version take more time than that of the ARM version? (Although I have taken care that intrinsic with minimum cycles in the batch is used)

  2. How do achieve what I want to achieve? (efficiency with NEON)

  3. Could someone point to me or share some sample implementations(pseudo-code/algorithms/code, not the theoretical implementation papers or talks) which uses the inter-operations of ARM-NEON together?

Any help would be much appreciated.

Here's my code:

uint16_t do_csum(const unsigned char * buff, int len)
{
int odd, count, i;

uint32x4_t result = veorq_u32( result, result), sum = veorq_u32( sum, sum); 
uint16x4_t data, data_hi, data_low, data8;
uint16x8_t dataq;
uint16_t result16, disp[20] = {0,0,0,0,0,0,0,0,0,0};

if (len <= 0)
    goto out;
odd = 1 & (unsigned long) buff;
if (odd) {
    uint8x8_t data1 = veor_u8( data1, data1); 
    data1 = (uint16x4_t)vld1_lane_u8((uint8_t *)buff, data1, 0); //result = *buff << 8;
    data1 = (uint16x4_t)vshl_n_u16( data1, 8);

    len--;
    buff++;
    result = vaddw_u16(result, data1);
}
count = len >> 1;       /* nr of 16-bit words.. */
if (count) {
    if (2 & (unsigned long) buff) {
        uint16x4_t data2 = veor_u16( data2, data2); 
        data2 = (uint16x4_t) vld1_lane_u16((uint16_t *)buff, data2, 0); //result += *(unsigned short *) buff;
        count--;
        len -= 2;
        buff += 2;
        result = vaddw_u16( result, data2);
    }
    count >>= 1;        /* nr of 32-bit words.. */
    if (count) {
        if (4 & (unsigned long) buff) {
            uint32x2_t data4 = (uint16x4_t) vld1_lane_u32((uint32_t *) buff, data4, 0);
            count--;
            len -= 4;
            buff += 4;
            result = vaddw_u16( result, data4);
        }
        count >>= 1;    /* nr of 64-bit words.. */
        if (count) {
            if (8 & (unsigned long) buff) {
                uint64x1_t data8 = vld1_u64((uint64_t *) buff); 
                count--;
                len -= 8;
                buff += 8;
                result = vaddw_u16( result,(uint16x4_t)data8);
            }
            count >>= 1;    /* nr of 128-bit words.. */
            if (count) {
                do {
                    dataq = (uint16x8_t)vld1q_u64((uint64_t *) buff); // VLD1.64 {d0, d1}, [r0]
                    count--;
                    buff += 16;

                    sum = vpaddlq_u16(dataq);   
                    vst1q_u16( disp, dataq); // VST1.16 {d0, d1}, [r0]

                    result = vaddq_u32( sum, result);
                } while (count);
            }
            if (len & 8) {
                uint64x1_t data8 =  vld1_u64((uint64_t *) buff); 
                buff += 8;
                result = vaddw_u16( result, (uint16x4_t)data8);
            }
        }
        if (len & 4) {
            uint32x2_t data4 = veor_u32( data4, data4); 

            data4 = (uint16x4_t)vld1_lane_u32((uint32_t *) buff, data4, 0);//result += *(unsigned int *) buff;
            buff += 4;
            result = vaddw_u16( result,(uint16x4_t) data4);
        }
    }
    if (len & 2) {
        uint16x4_t data2 = veor_u16( data2, data2); 
        data2 = (uint16x4_t) vld1_lane_u16((uint16_t *)buff, data2, 0); //result += *(unsigned short *) buff;
        buff += 2;
        result = vaddw_u16( result, data2);
    }
}
if (len & 1){
    uint8x8_t data1 = veor_u8( data1, data1); 
    data1 = (uint16x4_t) vld1_lane_u8((uint8_t *)buff, data1, 0); //result = *buff << 8;
    result = vaddw_u8( result, data1);
}


result16 = from128to16(result);

if (odd)
    result16 = ((result16 >> 8) & 0xff) | ((result16 & 0xff) << 8);

out:
    return result16;
}
Midrib answered 22/8, 2012 at 5:46 Comment(5)
Show your code and I'll be able to tell you what's wrong with it. Are you using GCC? If so, I would recommend writing assembly language in a separate file or use inline asm since GCC doesn't do well with intrinsics.Warbeck
@BitBank: Thanks, have edited my question to include the code, yes I'm using the cross compiler gcc. Using intrinsic, as I'm little unprepared to get into the shallow waters of assembly.Midrib
What value of len are you using for testing ? Also, are you compiling with -O3 ?Mediate
Thanks for the edit @Paul R, 1. The length is 2k bytes (data read from a file into an array into application then passed on to the do_sum function). 2. I'm using the following command for compiling: arm-none-linux-gnueabi-gcc -mfpu=neon -mfloat-abi=softfp -flax-vector-conversions -c <programName.c> not using any levels(honestly, dint know anything related to levels).Midrib
You really need gcc -O3 ... to enable compiler optimisations.Mediate
M
6

A few things you can improve:

  • Get rid of the stores to disp - this looks like debug code that got left in ?
  • Don't do horizontal addition within your main loop - just do partial (vertical) sums in the loop and do one final horizontal addition after the loop (see this answer for an example of how to do this - it's for SSE but the principle is the same)
  • Make sure you use gcc -O3 ... to get maximum benefit from compiler optimisation
  • Don't use goto ! (Doesn't affect performance but is evil.)
Mediate answered 22/8, 2012 at 6:16 Comment(8)
1. The disp code is indeed debug code, I'm commenting that out, got left out here, sorry about that. 2. Could you enlighten more on this? 3. Consider it done.Midrib
used the option suggested by you, :arm-none-linux-gnueabi-gcc -03 -mfpu=neon -mfloat-abi=softfp -flax-vector-conversions -c neonChecksum.c, but the compiler throws an error :arm-none-linux-gnueabi-gcc: unrecognized option '-03'Midrib
I'm sorry, did you mean O3(Alphabet O, numeric 3), it looks like 03(numeric 0, numeric 3), sorry for the above comment, now it compiled fine, will soon update you with my findings.Midrib
Whoa...!!! It works amazingly fast..!! The time now taken is: 0.050000 s..!! 16X better than ARM and 24X better than the NEON code that wasn't optimised using the option -O3..!! Thanks @Paul R. I'm all set to accept this answer, also if you could answer my other questions listed in my main question.Midrib
Just as a NOTE: if I also optimise ARM code while compiling with -O3 option, then its 0.200000 s, which means NEON code(optimised with -O3) is only 4X better the ARM code??Midrib
4X is not bad - you can do better if you are prepared to spend a lot of time writing and hand optimsiing NEON asm, but if you can meet your performance goals with a fairly simple implementation using intrinsics as above then be happy with that.Mediate
Note that on StackOverflow you should only ask one question per question - if you still have further questions then please post them as new questions. See the SO FAQ for more details on proper etiquette: stackoverflow.com/faqMediate
those questions are sub questions, related to this specific question. If I individually post them, they will be closed as open ended questions.Midrib

© 2022 - 2024 — McMap. All rights reserved.