There are several attempts to optimize calculation of HOG descriptor with using of SIMD instructions: OpenCV, Dlib, and Simd. All of them use scalar code to add resulting magnitude to HOG histogram:
float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size];
for(size_t i = 0; i < size; ++i)
{
histogram[y/8][x/8][idx[i]] += val[i]*ky[y]*kx[x];
histogram[y/8][x/8 + 1][idx[i]] += val[i]*ky[y]*kx[x + 1];
histogram[y/8 + 1][x/8][idx[i]] += val[i]*ky[y + 1]*kx[x];
histogram[y/8 + 1][x/8 + 1][idx[i]] += val[i]*ky[y + 1]*kx[x + 1];
}
There the value of size
depends from implementation but in general the meaning is the same.
I know that problem of histogram calculation with using of SIMD does not have a simple and effective solution. But in this case we have small size (18) of histogram. Can it help in SIMD optimizations?
I have found solution. It is a temporal buffer. At first we sum histogram to temporary buffer (and this operation can be vectorized). Then we add the sum from buffer to output histogram (and this operation also can be vectorized):
float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size];
float buf[18][4];
for(size_t i = 0; i < size; ++i)
{
buf[idx[i]][0] += val[i]*ky[y]*kx[x];
buf[idx[i]][1] += val[i]*ky[y]*kx[x + 1];
buf[idx[i]][2] += val[i]*ky[y + 1]*kx[x];
buf[idx[i]][3] += val[i]*ky[y + 1]*kx[x + 1];
}
for(size_t i = 0; i < 18; ++i)
{
histogram[y/8][x/8][i] += buf[i][0];
histogram[y/8][x/8 + 1][i] += buf[i][1];
histogram[y/8 + 1][x/8][i] += buf[i][2];
histogram[y/8 + 1][x/8 + 1][i] += buf[i][3];
}