cuda synchronization locking critical-section

Implementing a critical section in CUDA

I'm trying to implement a critical section in CUDA using atomic instructions, but I ran into some trouble. I have created the test program to show the problem:

#include <cuda_runtime.h>
#include <cutil_inline.h>
#include <stdio.h>

__global__ void k_testLocking(unsigned int* locks, int n) {
    int id = threadIdx.x % n;
    while (atomicExch(&(locks[id]), 1u) != 0u) {} //lock
    //critical section would go here
    atomicExch(&(locks[id]),0u); //unlock
}

int main(int argc, char** argv) {
    //initialize the locks array on the GPU to (0...0)
    unsigned int* locks;
    unsigned int zeros[10]; for (int i = 0; i < 10; i++) {zeros[i] = 0u;}
    cutilSafeCall(cudaMalloc((void**)&locks, sizeof(unsigned int)*10));
    cutilSafeCall(cudaMemcpy(locks, zeros, sizeof(unsigned int)*10, cudaMemcpyHostToDevice));

    //Run the kernel:
    k_testLocking<<<dim3(1), dim3(256)>>>(locks, 10);

    //Check the error messages:
    cudaError_t error = cudaGetLastError();
    cutilSafeCall(cudaFree(locks));
    if (cudaSuccess != error) {
        printf("error 1: CUDA ERROR (%d) {%s}\n", error, cudaGetErrorString(error));
        exit(-1);
    }
    return 0;
}

This code, unfortunately, hard freezes my machine for several seconds and finally exits, printing out the message:

fcudaSafeCall() Runtime API error in file <XXX.cu>, line XXX : the launch timed out and was terminated.

which means that one of those while loops is not returning, but it seems like this should work.

As a reminder atomicExch(unsigned int* address, unsigned int val) atomically sets the value of the memory location stored in address to val and returns the old value. So the idea behind my locking mechanism is that it is initially 0u, so one thread should get past the while loop and all other threads should wait on the while loop since they will read locks[id] as 1u. Then when the thread is done with the critical section, it resets the lock to 0u so another thread can enter.

What am I missing?

By the way, I am compiling with:

nvcc -arch sm_11 -Ipath/to/cuda/C/common/inc XXX.cu

Solution

Okay, I figured it out, and this is yet-another-one-of-the-cuda-paradigm-pains.

As any good cuda programmer knows (notice that I did not remember this which makes me a bad cuda programmer, I think) all threads in a warp must execute the same code. The code I wrote would work perfectly if not for this fact. As it is, however, there are likely to be two threads in the same warp accessing the same lock. If one of them acquires the lock, it just forgets about executing the loop, but it cannot continue past the loop until all other threads in its warp have completed the loop. Unfortunately the other thread will never complete because it is waiting for the first one to unlock.

Here is a kernel that will do the trick without error:

__global__ void k_testLocking(unsigned int* locks, int n) {
    int id = threadIdx.x % n;
    bool leaveLoop = false;
    while (!leaveLoop) {
        if (atomicExch(&(locks[id]), 1u) == 0u) {
            //critical section
            leaveLoop = true;
            atomicExch(&(locks[id]),0u);
        }
    } 
}