Being a beginner with CUDA my naive model of the following simple kernel
__global__ void sumColumns(float* d_a, float* d_sum, int n) {
if (blockIdx.x < 2 && threadIdx.x < n) {
d_sum[blockIdx.x] += d_a[threadIdx.x * 2 + blockIdx.x];
}
}
was that given a 1024 x 2 matrix d_a
, a call like
sumColumns <<<2, 1024>>> (d_a, d_sum, 1024);
would go through all rows, columns (asynchonously) and then generate the sum of each column (dsum
is a 2 x 1 vector initialised to 0). Of course, nothing like this happens. The result varies from run to run; it never produces the sum of the whole column. And in most case the sum is well below +/-1024 (I initialise the matrix to have a column of 1s and another column of -1s).
The model I have in my mind is that, after the kernel call, the GPU generates 1024 threads in each block and then given the current thread the if
statement does the correct mapping. Can someone provide a correct description of what is going on?
It's a race condition, you have multiple threads trying to write the same memory.
To resolve, you may
__global__ void sumColumns(float* d_a, float* d_sum, int n) {
if (blockIdx.x < 2 && threadIdx.x < n) {
atomicAdd(&d_sum[blockIdx.x], d_a[threadIdx.x * 2 + blockIdx.x]);
}
}
__global__ void sumColumns(float* d_a, float* d_sum, int n) {
// Shared memory for partial sums, same logic as your blockIdx.x < 2
__shared__ float s_sum[2];
if (threadIdx.x == 0) {
s_sum[0] = 0.0f;
s_sum[1] = 0.0f;
}
__syncthreads();
if (blockIdx.x < 2 && threadIdx.x < n) {
// Reduces global memory atomic contention.
// Atomicity lies only within blocks.
atomicAdd(&s_sum[blockIdx.x], d_a[threadIdx.x * 2 + blockIdx.x]);
}
__syncthreads();
// Single atomic update per block
if (threadIdx.x == 0) {
atomicAdd(&d_sum[blockIdx.x], s_sum[blockIdx.x]);
}
}