I am trying to implement simple parallel reduction. I am using the code from the CUDA SDK. But somehow there is a problem in my kernel as the shared array is not getting values of the global array and its all zeroes.
extern __ shared __ float4 sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = dev_src[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s = 1; s < blockDim.x; s *= 2) {
if(tid % (2*s) == 0){
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if(tid == 0)
out[blockIdx.x] = sdata[0];
Edit:
Ok I got it working by removing the extern
keyword and making the shared array a constant size like 512
. I am in good shape now. Maybe someone can explain why it was not working with the extern
keyword.
I think I know why this is happening as I have faced this before. How are you launching the kernel?
Remember in the launch kernel<<<blocks,threads,sharedMemory>>>
the sharedMemory
should be the size of the shared memory in bytes. So, if you are declaring for 512 elements, the third parameter should be 512 * sizeof(float4)
. I think you are just calling as below, which is wrong
kernel<<<blocks,threads,512>>> // this is wrong