CUDA __threadfence() and atomics

I have the following problem in CUDA. Suppose I have two locations in memory, a and b. Let's say they are 128 bit unsigned integers, used as bitmasks.

Thread A is going to modify a and then read from b.

Thread B is going to modify b and then read from a.

I need to ensure that at least one thread (does not matter which one) will know about both modifications. Since each thread will know about its own modification, I have to make sure that at least one of the threads sees the value modified by the other thread.

If I understand the CUDA documentation correctly, I could maybe achieve this as follows:

// Thread A
(void)atomicOr(a, 0x2); // 0x2 is just an example, let's say I want to set bit 2 of *a
__threadfence();
b_read_by_A = atomicOr(b, 0);

// Thread B
(void)atomicOr(b, 0x2);
__threadfence();
a_read_by_B = atomicOr(a, 0);

My reasoning why this should be correct is as follows: __threadfence() guarantees that a write occurring after the __threadfence() does not become visible when a write occurring before the __threadfence() has not. Now we have two possibilities: Either the result of atomicOr(b,0) executed by A is visible to B when atomicOr(b,0x2) is executed, then the result of atomicOr(a,0x2) is also visible to B, because of the __threadfence() in A. In this case B, will know about the modification of a. Or the result of atomic(b,0) is not visible to B when atomicOr(b,0x2) is executed. Then, because the operations are atomic, the result of atomicOr(b,0x2) will visible to A when atomicOr(b,0) is executed. In this case, A will know about the modification of b.

Is my reasoning correct?

Am I right in assuming that I cannot replace the second atomicOr, i.e. atomicOr(.,0), by a simple read? And that I do need the __threadfence()s?

Solution

If these are really 128 bit values, then you cannot use this code. Because CUDA does not have 128-bit atomic operations. Maybe Blackwell does, but I'm guessing you do not have a Blackwell chip, also libcu++ has not caught on with Blackwell yet.

One thread might change the upper and lower 64 bits and another thread might have a torn read, seeing only half the update. There is no way your code guards against that, although you did add the __threadfence to fix the relaxed nature of the atomic. This is not enough.

You can have atomicOr(32-bit) or atomicOr(64-bit), but not 128 bit. Because of this if you want to work on structures > 64-bit, you need to use a mutex.

Also it is good to keep in mind that CUDA atomics use relaxed memory ordering. If you use the <cuda/atomics> header from libcu++ you'll get a more comprehensive set of memory orderings. Sequential consistency is the default.
See: https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives.html

But note that CUDA does not support lock-free atomic operations > 64-bits.

IMO the best and fastest solution is to use a mutex to guard the lock. Do not roll your own atomic solutions, best to stick with known idioms.

#include <cuda.h>
#include <cuda/atomics>

class mybitset {
    cuda::bitset<128> _bits
    cuda::binary_semaphore<cuda::thread_scope_system> lock;
    struct lock_guard { //RAII lock guard, unlocks at end of scope
        cuda::binary_semaphore<cuda::thread_scope_system>& _lock;
        lock_guard(cuda::binary_semaphore<cuda::thread_scope_system>& lock): _lock(lock) { lock.acquire(); }
        ~lock_guard() { lock.release(); }
    };
public:
    //cuda::atomic_ref<decltype(_bits), cuda::thread_scope_device> bits(_bits); //not allowed for data > 8 bytes
    void set(int pos, bool value) { 
        auto _lock = lock_guard(lock);
        _bits.set(pos, value);
    }
    cuda::bitset<128> old() const { 
        auto _lock = lock_guard(lock);
        return _bits; //return a copy 
    }
    for the rest see std::bitset
}

The lock makes sure every thread sees a consistent view of the data and never one that has half baked updates from another thread.

There is no way to do > 128 bit atomics without some form of locking. Although you might integrate the mutex into the bitset by using one of those 128 bits as a lock bit. But then you'd have to write the atomicCAS to perform that lock yourself, using the mutex as shown above is much easier.

Make sure you never access the data without locking it first.
Always use the lock_guard, that way you can never forget to unlock.

See: https://nvidia.github.io/cccl/libcudacxx/index.html Most stuff in libcu++ is back ported to at least C++17, so you don't need C++20 set.