This simple reduction function is found in one of the CUDA presentations online.
__device__ void reducedSum(double* d_idata, double* d_odata, long LENGTH)
{
extern __shared__ double sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < LENGTH) {
sdata[tid] = d_idata[i];
__syncthreads();
printf("Kernel sdata : %d \n", sdata[tid]);
for (unsigned int s = 1; s < blockDim.x; s *= 2)
{
if (tid % (2 * s) == 0)
{
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) {
d_odata[blockIdx.x] = sdata[0];
}
}
}
But the printf here always prints the following output. What It's expected to do is actually to copy the values from d_idata array and assign it partly to each shared memory block. However it doesn't happen.
Call to the kernel is as follows:
long LENGTH = 10;
long N = 5;
int threadsPerBlock = N;
int numBlocks = (threadsPerBlock + LENGTH - 1) / threadsPerBlock;
cudaCalc<<<numBlocks, threadsPerBlock, N*sizeof(double)>>> (d_vec1, d_vec2, d_dotProduct, ....)
Now inside the kernel I call this reducedSum
__device__
function as follows.
__global__ void cudaCalc(int* d_vec1, int* d_vec2, double* d_dotProduct, ... )
{
int tid_0 = threadIdx.x;
int index = blockDim.x * blockIdx.x + threadIdx.x;
if (index < LENGTH) {
d_dotProduct[index] = (double) d_vec1[index] * d_vec2[index];
d_squared1[index] = (double)d_vec1[index] * d_vec1[index];
d_squared2[index] = (double)d_vec2[index] * d_vec2[index];
__syncthreads();
}
reducedSum(d_squared1, d_squaredSum1, LENGTH);
reducedSum(d_squared2, d_squaredSum2, LENGTH);
reducedSum(d_dotProduct, d_dotSum, LENGTH);
}
Can some good sir/madam please show me where my code is wrong? If you want to see the rest of the code please request. Thank you in advance.
The fault was with the printf function. I can't believe I spent hours on this.
printf("Kernel sdata : %d \n", sdata[tid]);
The placeholder is given for integer while the sdata is a double array. The problem solved.
It's such a bummer that nvcc compiler doesn't show a warning or an error for this type of mistakes. gcc on the other hand shows so many warning. This is should be a suggestion.