craspberry-pisimdneonarmv8

Handling elements that are odd number using neon intrinsics


I am new to neon intrinsics. I have two arrays containing 99 elements which I am trying to add them element wise using neon intrinsic. As 99 is not a multiple of 8,16 or 32. 96 elements can be handled how to handle the remaining 3 elements. please help here is the code that I have written

 #include <arm_neon.h>
 #define SIZE 99
 void addition(unsigned char A[],unsigned char B[],unsigned short int *addres)
{
   uint8x8_t v,v1;
   int i=0;
   for (i=0;i<SIZE;i=i+8){
   v = vld1_u8(&A[i]); // load the array from memory into a vector
   v1=vld1_u8(&B[I]);
   uint16x8_t t = vaddl_u8(v,v1);
   vst1q_u16(addres+i,t); // store the vector back to memory
  }
}

Solution

  • The by far most efficient way dealing with residuals on SIMD I came up with so far is what I call "withold and rewind" method. (invented by me, I suppose)

    void addl(uint16_t *pDst, uint8_t *pA, uint8_t *pB, intptr_t size)
    {
        // assert(size >= 8);
        uint8x8_t a, b;
        uint16x8_t c;
    
        size -= 8; // withold
    
        do {
            do {
                size -= 8;
                a = vld1_u8(pA);
                b = vld1_u8(pB);
                c = vaddl_u8(a, b);
                vst1q_u16(pDst, c);
                pA += 8;
                pB += 8;
                pDst += 8;
            } while (size >= 0);
    
            pA += size;      // and rewind
            pB += size;
            pDst += size;
        } while (size > -8);
    }
    

    size can be any number greater equal 8.
    There are three drawbacks though:

    PS: size has to be of the type intptr_t. The process will crash on 64bit machines otherwise.