GPU: Quadro RTX 4000
CUDA: 11.7
I have implemented a global lock by atomic operations like:
__global__ void floatAddLockExch(float* addr, volatile int* lock) {
bool goon = true;
while (goon) {
if (atomicCAS((int*)lock, 0, 1) == 0) {
*addr += 1; //critical task
int lockValue = atomicExch((int*)lock, 0);
if (lockValue != 1) {
printf("Error in <%d, %d> \n", blockIdx.x, threadIdx.x);
}
//
// *lock = 0;
// __threadfence();
goon = false;
}
}
}
Tested this kernel by <<<1
, 1024>>>, the output(value of *addr) was 1024
; tested by <<<2
, 1024>>>, the output was 1025
. There was No
"Error..." output in both cases.
Helped by Cuda atomics change flag, I've implemented the kernel as:
__global__ void floatAddLockFence(float* addr, volatile int* lock) {
bool goon = true;
while (goon) {
if (atomicCAS((int*)lock, 0, 1) == 0) {
*addr += 1; //critical task
// int lockValue = atomicExch((int*)lock, 0);
// if (lockValue != 1) {
// printf("Error in <%d, %d> \n", blockIdx.x, threadIdx.x);
// }
*lock = 0;
__threadfence();
goon = false;
}
}
}
The output was 1024
in <<<1
, 1024>>> case and 2048
in <<<2
, 1024>>> case.
It is supposed that atomic operation on global variable lock
is atomic across all blocks, why floatAddLockExch
failed in multi-block case? How __threadfence()
solved the problem?
The reason the __threadfence()
makes a difference is not due to the lock mechanism itself, but the effect on the handling of *addr
.
The lock mechanism is working in the multiblock case in that it is still serializing thread updates to the *addr
variable, but the *addr
variable handling is being affected by L1 cache activity. Threads within the same threadblock are guaranteed to have a consistent view of the L1 cache contents. Threads across separate blocks are not, because those blocks may be located on separate SMs, and separate SMs also have separate L1 cache.
The __threadfence()
makes this update of the *addr
visible to all blocks. You might also be able to witness the same effect if you used your first (failing) test, but changed the kernel prototype decoration of float *addr
to volatile float *addr
. The volatile
keyword generally causes bypassing of the L1 cache.
CUDA has recently introduced a libcu++ library that emulates parts of the "standard" C++ libraries.
One of the functionalities available so far is semaphores.
The following can be a libcu++ method to achieve a "critical section":
$ cat t1941.cu
#include <cuda/semaphore>
#include <iostream>
#include <cstdio>
__device__ cuda::binary_semaphore<cuda::thread_scope_device> s(1);
__global__ void k(volatile int *d){
s.acquire();
int test = *d;
printf("block: %d, thread: %d, test: %d\n", blockIdx.x, threadIdx.x, test);
test += 1;
*d = test;
__threadfence();
s.release();
}
int main(){
int *d;
int h;
cudaMalloc(&d, sizeof(d[0]));
cudaMemset(d, 0, sizeof(d[0]));
k<<<2,2>>>(d);
cudaMemcpy(&h, d, sizeof(d[0]), cudaMemcpyDeviceToHost);
std::cout << "d = " << h << std::endl;
}
$ nvcc -o t1941 t1941.cu -arch=sm_70
$ ./t1941
block: 0, thread: 0, test: 0
block: 0, thread: 1, test: 1
block: 1, thread: 0, test: 2
block: 1, thread: 1, test: 3
d = 4
$
A few notes on the above example.
__device__
variable only happens once, which is sufficient for this example. If you intend to use the semaphore repeatedly, for perhaps several kernel launches, then it will be necessary to re-initialize the device variable, perhaps using cudaMemcpyToSymbol()
.