Is there proper CUDA atomicLoad function?

I've faced with the issue that CUDA atomic API do not have atomicLoad function. After searching on stackoverflow, I've found the following implementation of CUDA atomicLoad

But looks like this function is failed to work in following example:

#include <cassert>
#include <iostream>
#include <cuda_runtime_api.h>

template <typename T>
__device__ T atomicLoad(const T* addr) {
    const volatile T* vaddr = addr;  // To bypass cache
    __threadfence();                 // for seq_cst loads. Remove for acquire semantics.
    const T value = *vaddr;
    // fence to ensure that dependent reads are correctly ordered
    __threadfence();
    return value;
}

__global__ void initAtomic(unsigned& count, const unsigned initValue) {
    count = initValue;
}

__global__ void addVerify(unsigned& count, const unsigned biasAtomicValue) {
    atomicAdd(&count, 1);
    // NOTE: When uncomment the following while loop the addVerify is stuck,
    //       it cannot read last proper value in variable count
//    while (atomicLoad(&count) != (1024 * 1024 + biasAtomicValue)) {
//        printf("count = %u\n", atomicLoad(&count));
//    }
}

int main() {
    std::cout << "Hello, CUDA atomics!" << std::endl;
    const auto atomicSize = sizeof(unsigned);

    unsigned* datomic = nullptr;
    cudaMalloc(&datomic, atomicSize);

    cudaStream_t stream;
    cudaStreamCreate(&stream);

    constexpr unsigned biasAtomicValue = 11;
    initAtomic<<<1, 1, 0, stream>>>(*datomic, biasAtomicValue);
    addVerify<<<1024, 1024, 0, stream>>>(*datomic, biasAtomicValue);
    cudaStreamSynchronize(stream);

    unsigned countHost = 0;
    cudaMemcpyAsync(&countHost, datomic, atomicSize, cudaMemcpyDeviceToHost, stream);
    assert(countHost == 1024 * 1024 + biasAtomicValue);

    cudaStreamDestroy(stream);

    return 0;
}

If you will uncomment the section with atomicLoad then application will stuck ...

Maybe I missed something ? Is there a proper way to load variable modified atomically ?

P.S.: I know there exists cuda::atomic implementation, but this API is not supported by my hardware

Solution

Since warps work in a lockstep manner (at least in old arch), if you put a conditional wait for one thread and a producer on another thread, both in same warp, then the warp could be stuck in the waiting if it starts/is executed first. Maybe only newest architecture that has asynchronous warp thread scheduling can do this. For example, you should query minor-major versions of cuda architecture before running this. Volta and onwards is ok.

Also you are launching 1million threads and waiting on all of them at once. GPU may not have that many execution ports/pipeline availability to have 1 million threads in-flight. Maybe it would work in only a GPU of 64k CUDA pipelines (assuming 16 threads in flight per pipeline). Instead of waiting on millions of threads, just spawn sub-kernels from main kernel when a condition occurs. Dynamic parallelism is the key feature. You should also check for the minimum minor-major cuda version to use dynamic parallelism just in case someone is using ancient nvidia cards.

Atomic-add command returns the old value in the target address. If you have meant to call a third kernel only once only after the condition, then you can simply check that returned value by an "if" before starting the dynamic parallelism.

You are printing for 1 million times, it is not good for performance and it may take some time before text appears in console output if you have a slow CPU/RAM.

Lastly, you can optimize performance of atomic operations by running them on shared memory first then going global atomic only once per block. This will miss the point of condition if there are more threads than the condition value (assuming always 1 increment value) so it may not be applicable for all algorithms.