I am working on a kernel that does a vector reduction. It basically adds up all the positions in the vector and stores the result in position 0.
I'm following this scheme, with blocks of 512 float elements:
The code:
//scratch[] is a vector located in shared memory with all 512 elements
NUM_ELEMENTS = 512;
for( stride=NUM_ELEMENTS/2; stride>=1; stride = stride/2 ) {
if (threadIdx.x < stride){
scratch[threadIdx.x] += scratch[threadIdx.x + stride];
}
__syncthreads();
}
The odd thing is, I'm expecting to get shared bank conflicts and I'm not. In the first iteration, thread 0 is adding up position 0 and position 256, which reside in the same bank. Thread 1 is adding up position 1 and position 257, and so on.
All of these operations require each thread in the warp to obtain 2 distinct values from the same bank, yet, I get no conflicts whatsoever:
What am I missing?
The calculation for bank conflicts is on a per memory instruction per request basis. The shared load (right hand side) and the shared store (left hand side) are executed as separate instructions many clock cycles apart.