I am trying to accelerate a stereo matching algorithm on ODROID XU4 ARM platform using Neon SIMD. For this puropose I am using openMp's pragmas.
void StereoMatch:: sadCol(uint8_t* leftRank,uint8_t* rightRank,const int SAD_WIDTH,const int SAD_WIDTH_STEP, const int imgWidth,int j, int d , uint16_t* cost)
{
uint16_t sum = 0;
int n = 0;
int m =0;
for ( n = 0; n < SAD_WIDTH+1; n++)
{
#pragma omp simd
for( m = 0; m< SAD_WIDTH_STEP; m = m + imgWidth )
{
sum += abs(leftRank[j+m+n]-rightRank[j+m+n-d]);
};
cost[n] = sum;
sum = 0;
};
I am fairly new to SIMD and openMp, I understood that using the SIMD pragma in the code will direct the compiler to vectorize the subtraction, but when I executed the code I noticed no difference. What should I add to my code in order to vectorize it ?
As said in the comments, ARM-Neon has an instruction which directly does what you want, i.e., compute the absolute difference of unsigned bytes and accumulates it to unsigned short-integers.
Assuming SAD_WIDTH+1==8
, here is a very simple implementation using intrinsics (based on the simplified version by @nemequ):
void sadCol(uint8_t* leftRank,
uint8_t* rightRank,
int j,
int d ,
uint16_t* cost) {
const int SAD_WIDTH = 7;
const int imgWidth = 320;
const int SAD_WIDTH_STEP = SAD_WIDTH * imgWidth;
uint16x8_t cost_8 = {0};
for(int m = 0; m < SAD_WIDTH_STEP; m = m + imgWidth ) {
cost_8 = vabal_u8(cost_8, vld1_u8(&leftRank[j+m]), vld1_u8(&rightRank[j+m-d]));
};
vst1q_u16(cost, cost_8);
};
vld1_u8
loads 8 consecutive bytes, vabal_u8
computes the absolute difference and accumulates it to the first register. Finally, vst1q_u16
stores the register to memory.
You can easily make imgWidth
and SAD_WIDTH_STEP
function parameters. If SAD_WIDTH+1
is a different multiple of 8, you can write another loop for that.
I have no ARM platform at hand to test it, but "it compiles": https://godbolt.org/z/vPqiYI (and the assembly looks fine, in my eyes). If you optimize with -O3
gcc will unroll the loop.