I recently discovered the racecheck tool of cuda-memcheck, available in CUDA 5.0 (cuda-memcheck --tool racecheck
, see the NVIDIA doc). This tool can detect race conditions with shared memory in a CUDA kernel.
In debug mode, this tool does not detect anything, which is apparently normal. However, in release mode (-O3
), I get errors depending on the parameters of the problem.
Here is an error example (initialization of shared memory on line 22, assignment on line 119):
========= ERROR: Potential WAW hazard detected at __shared__ 0x0 in block (35, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000890 in ....h:119:void kernel_test3<float, unsigned int=4, unsigned int=32, unsigned int=64>(Data<float, unsigned int=4, unsigned int=32, unsigned int=64>*)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:22:void kernel_test3<float, unsigned int=4, unsigned int=32, unsigned int=64>(Data<float, unsigned int=4, unsigned int=32, unsigned int=64>*)
========= Current Value : 13, Incoming Value : 0
threadIdx.x
, but this changed nothing.__shared__ float arr[SIZE_1][SIZE_2][NB_THREADS_PER_BLOCK]
. I do not really understand how there could be any race conditions, since each thread deals with its own part of shared memory.In order to understand what was happening, I tested with some simpler kernels.
Let me show you an example of a kernel that creates that kind of error. Basically, this kernel uses SIZE_X*SIZE_Y*NTHREADS*sizeof(float)
bytes of shared memory, and I can use 48KB of shared memory per SM.
test.cu
template <unsigned int NTHREADS>
__global__ void kernel_test()
{
const int SIZE_X = 4;
const int SIZE_Y = 4;
__shared__ float tmp[SIZE_X][SIZE_Y][NTHREADS];
for (unsigned int i = 0; i < SIZE_X; i++)
for (unsigned int j = 0; j < SIZE_Y; j++)
tmp[i][j][threadIdx.x] = threadIdx.x;
}
int main()
{
const unsigned int NTHREADS = 32;
//kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
kernel_test<NTHREADS><<<64, NTHREADS>>>();
cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}
Compilation:
nvcc test.cu --ptxas-options=-v -o test
If we run the kernel:
cuda-memcheck --tool racecheck test
kernel_test<32><<<32, 32>>>();
: 32 blocks, 32 threads => does not lead to any apparent racecheck error.
kernel_test<32><<<64, 32>>>();
: 64 blocks, 32 threads => leads to WAW hazards (threadId.x
= 32?!) and errors.
========= ERROR: Potential WAW hazard detected at __shared__ 0x6 in block (57, 0, 0) :
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (1, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 128
========= INFO:(Identical data being written) Potential WAW hazard detected at __shared__ 0x0 in block (47, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 0
So what am I missing here? Am I doing something wrong with shared memory? (I am still a beginner with this)
The problem seems to be coming from cudaDeviceSynchronize()
when NBLOCKS > 32
. Why is this happening?
This was apparently a bug in NVIDIA drivers for Linux. The bug disappeared after the 313.18 release.