I am new to neon intrinsics. I have two arrays containing 99 elements which I am trying to add them element wise using neon intrinsic. As 99 is not a multiple of 8,16 or 32. 96 elements can be handled how to handle the remaining 3 elements. please help here is the code that I have written
#include <arm_neon.h>
#define SIZE 99
void addition(unsigned char A[],unsigned char B[],unsigned short int *addres)
{
uint8x8_t v,v1;
int i=0;
for (i=0;i<SIZE;i=i+8){
v = vld1_u8(&A[i]); // load the array from memory into a vector
v1=vld1_u8(&B[I]);
uint16x8_t t = vaddl_u8(v,v1);
vst1q_u16(addres+i,t); // store the vector back to memory
}
}
The by far most efficient way dealing with residuals on SIMD I came up with so far is what I call "withold and rewind" method. (invented by me, I suppose)
void addl(uint16_t *pDst, uint8_t *pA, uint8_t *pB, intptr_t size)
{
// assert(size >= 8);
uint8x8_t a, b;
uint16x8_t c;
size -= 8; // withold
do {
do {
size -= 8;
a = vld1_u8(pA);
b = vld1_u8(pB);
c = vaddl_u8(a, b);
vst1q_u16(pDst, c);
pA += 8;
pB += 8;
pDst += 8;
} while (size >= 0);
pA += size; // and rewind
pB += size;
pDst += size;
} while (size > -8);
}
size
can be any number greater equal 8.
There are three drawbacks though:
size
HAS TO BE >=8 (no problem in most cases)aarch32
assembly (no problem in intrinsics)PS: size
has to be of the type intptr_t
. The process will crash on 64bit machines otherwise.