[SOLVED] CUDA racecheck, shared memory array and cudaDeviceSynchronize()

CUDA racecheck, shared memory array and cudaDeviceSynchronize()

I recently discovered the racecheck tool of cuda-memcheck, available in CUDA 5.0 (cuda-memcheck --tool racecheck, see the NVIDIA doc). This tool can detect race conditions with shared memory in a CUDA kernel.

In debug mode, this tool does not detect anything, which is apparently normal. However, in release mode (-O3), I get errors depending on the parameters of the problem.

Here is an error example (initialization of shared memory on line 22, assignment on line 119):

========= ERROR: Potential WAW hazard detected at __shared__ 0x0 in block (35, 0, 0) :
=========     Write Thread (32, 0, 0) at 0x00000890 in ....h:119:void kernel_test3<float, unsigned int=4, unsigned int=32, unsigned int=64>(Data<float, unsigned int=4, unsigned int=32, unsigned int=64>*)
=========     Write Thread (0, 0, 0) at 0x00000048 in ....h:22:void kernel_test3<float, unsigned int=4, unsigned int=32, unsigned int=64>(Data<float, unsigned int=4, unsigned int=32, unsigned int=64>*)  
=========     Current Value : 13, Incoming Value : 0

The first thing that surprised me is the thread ids. When I first encountered the error, each block contained 32 threads (ids 0 to 31). So why is there a problem with the thread id 32? I even added an extra check on threadIdx.x, but this changed nothing.
I use shared memory as a temporary buffer, and each thread deals with its own parameters of a multidimensional array, e.g. __shared__ float arr[SIZE_1][SIZE_2][NB_THREADS_PER_BLOCK]. I do not really understand how there could be any race conditions, since each thread deals with its own part of shared memory.
Reducing the grid size from 64 blocks to 32 blocks seemed to solve the issue (with 32 threads per block). I do not understand why.

In order to understand what was happening, I tested with some simpler kernels. Let me show you an example of a kernel that creates that kind of error. Basically, this kernel uses SIZE_X*SIZE_Y*NTHREADS*sizeof(float) bytes of shared memory, and I can use 48KB of shared memory per SM.

test.cu

template <unsigned int NTHREADS>
__global__ void kernel_test()
{
    const int SIZE_X = 4;
    const int SIZE_Y = 4;

    __shared__ float tmp[SIZE_X][SIZE_Y][NTHREADS];

    for (unsigned int i = 0; i < SIZE_X; i++)
        for (unsigned int j = 0; j < SIZE_Y; j++)
            tmp[i][j][threadIdx.x] = threadIdx.x;
}

int main()
{
  const unsigned int NTHREADS = 32;

  //kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
  kernel_test<NTHREADS><<<64, NTHREADS>>>();

  cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}

Compilation:

nvcc test.cu --ptxas-options=-v -o test

If we run the kernel:

cuda-memcheck --tool racecheck test

kernel_test<32><<<32, 32>>>(); : 32 blocks, 32 threads => does not lead to any apparent racecheck error.

kernel_test<32><<<64, 32>>>(); : 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.

========= ERROR: Potential WAW hazard detected at __shared__ 0x6 in block (57, 0, 0) :  
=========     Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)  
=========     Write Thread (1, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)  
=========     Current Value : 0, Incoming Value : 128  

========= INFO:(Identical data being written) Potential WAW hazard detected at __shared__ 0x0 in block (47, 0, 0) :  
=========     Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)  
=========     Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)  
=========     Current Value : 0, Incoming Value : 0

So what am I missing here? Am I doing something wrong with shared memory? (I am still a beginner with this)

Update:

The problem seems to be coming from cudaDeviceSynchronize() when NBLOCKS > 32. Why is this happening?

Solution

This was apparently a bug in NVIDIA drivers for Linux. The bug disappeared after the 313.18 release.