CUDA: release lock implemented by atomic operations

GPU: Quadro RTX 4000

CUDA: 11.7

I have implemented a global lock by atomic operations like:

__global__ void floatAddLockExch(float* addr, volatile int* lock) {
    bool goon = true;
    while (goon) {
        if (atomicCAS((int*)lock, 0, 1) == 0) {
            *addr += 1; //critical task

            int lockValue = atomicExch((int*)lock, 0);
            if (lockValue != 1) {
                printf("Error in <%d, %d> \n", blockIdx.x, threadIdx.x);
            }
//
//          *lock = 0;
//          __threadfence();

            goon = false;
        }
    }
}

Tested this kernel by <<<1, 1024>>>, the output(value of *addr) was 1024; tested by <<<2, 1024>>>, the output was 1025. There was No "Error..." output in both cases.

Helped by Cuda atomics change flag, I've implemented the kernel as:

__global__ void floatAddLockFence(float* addr, volatile int* lock) {
    bool goon = true;
    while (goon) {
        if (atomicCAS((int*)lock, 0, 1) == 0) {
            *addr += 1; //critical task

//          int lockValue = atomicExch((int*)lock, 0);
//          if (lockValue != 1) {
//              printf("Error in <%d, %d> \n", blockIdx.x, threadIdx.x);
//          }

            *lock = 0;
            __threadfence();

            goon = false;
        }
    }
}

The output was 1024 in <<<1, 1024>>> case and 2048 in <<<2, 1024>>> case.

The test code in gist

It is supposed that atomic operation on global variable lock is atomic across all blocks, why floatAddLockExch failed in multi-block case? How __threadfence() solved the problem?

Solution

The reason the __threadfence() makes a difference is not due to the lock mechanism itself, but the effect on the handling of *addr.

The lock mechanism is working in the multiblock case in that it is still serializing thread updates to the *addr variable, but the *addr variable handling is being affected by L1 cache activity. Threads within the same threadblock are guaranteed to have a consistent view of the L1 cache contents. Threads across separate blocks are not, because those blocks may be located on separate SMs, and separate SMs also have separate L1 cache.

The __threadfence() makes this update of the *addr visible to all blocks. You might also be able to witness the same effect if you used your first (failing) test, but changed the kernel prototype decoration of float *addr to volatile float *addr. The volatile keyword generally causes bypassing of the L1 cache.

CUDA has recently introduced a libcu++ library that emulates parts of the "standard" C++ libraries.

One of the functionalities available so far is semaphores.

The following can be a libcu++ method to achieve a "critical section":

$ cat t1941.cu
#include <cuda/semaphore>
#include <iostream>
#include <cstdio>
 
__device__ cuda::binary_semaphore<cuda::thread_scope_device> s(1);
 
__global__ void k(volatile int *d){
 
  s.acquire();
  int test = *d;
  printf("block: %d, thread: %d, test: %d\n", blockIdx.x, threadIdx.x, test);
  test += 1;
  *d = test;
  __threadfence();
  s.release();
}
 
 
int main(){
 
  int *d;
  int h;
  cudaMalloc(&d, sizeof(d[0]));
  cudaMemset(d, 0, sizeof(d[0]));
  k<<<2,2>>>(d);
  cudaMemcpy(&h, d, sizeof(d[0]), cudaMemcpyDeviceToHost);
  std::cout << "d = " << h << std::endl;
}
$ nvcc -o t1941 t1941.cu -arch=sm_70
$ ./t1941
block: 0, thread: 0, test: 0
block: 0, thread: 1, test: 1
block: 1, thread: 0, test: 2
block: 1, thread: 1, test: 3
d = 4
$

A few notes on the above example.

it will require a volta or newer device, be sure to compile for the architecture to match your GPU (cc7.0 or newer)
it will require CUDA 11.3 or newer
the dynamic initialization of the __device__ variable only happens once, which is sufficient for this example. If you intend to use the semaphore repeatedly, for perhaps several kernel launches, then it will be necessary to re-initialize the device variable, perhaps using cudaMemcpyToSymbol().