If I have the following code:
#include <cuda.h>
#include <cuda/atomic>
#include <cassert>
#include <stdio.h>
__global__ void init(int* atomicCounters) {
atomicCounters[threadIdx.x] = 0;
}
__global__ void test(int* atomicCounters) {
for (auto i = 0; i < 100000; i++) {
__syncwarp();
int val = 1;
const auto old = atomicAdd(&atomicCounters[threadIdx.x], val);
const auto match_mask = __match_any_sync(-1u, old);
if (match_mask != -1u) {
printf("error, mask should be -1, but is $%x\n", match_mask);
assert(false);
}
}
__syncwarp();
if (threadIdx.x == 0) { printf("b: %i is done\n", blockIdx.x); }
}
int main() {
//error checking code omitted
int* atomicCounters;
cudaMalloc(&atomicCounters, 32 * sizeof(int));
init<<<1,32>>>(atomicCounters);
test<<<32,32>>>(atomicCounters);
cudaDeviceSynchronize();
}
Is the assert guaranteed to never fire?
I.e. if all threads in the warp increment their own counter, and these warps are always synchronized before doing the atomicAdd, will these adds never tear?
Link to godbolt: https://cuda.godbolt.org/z/aca9PrPcb
For reference, I'm using this use to update 3 counters in parallel atomically and I want to know if that is safe.
Something like (pseudo code):
int sum[3];
sum[0] = get_sum1();
sum[1] = get_sum2();
sum[2] = get_sum3(); sum[2] *= sum[2];
__syncwarp();
if (threadIdx.x < 3) {
atomicAdd(&counter[threadIdx.x], sum[threadIdx.x]);
}
///further down in the code
//query 3 counters
#include <cuda/atomic>
__syncwarp();
if (threadIdx.x < 3) {
auto old = counter[threadIdx.x].load(acquire);
assert(__match_any_sync(1+2+4, old) == (1+2+4));
}
I don't want to add a lock here, but just update the 3 counters in a synchronized manner. And I was thinking this should be safe provided I ensure all 3 values fall in the same cache-line.
You should never assume anything about the underlying hardware model in CUDA that Nvidia hasn't specifically told you. Don't speculate, don't sniff around bespoke hardware documents, if Nvidia hasn't told you to assume something, don't assume it, because it can come back to bite you.
For example, on some older nvidia hardware, it used to be possible to use shared memory between kernel invocations if you didn't specifically clear it yourself (there were other things to it, but that's the gist). But Nvidia's documentation explicitly stated to not assume this was the case. Many CUDA code bases relied on this for "performance" reasons, and now they do not work on most Nvidia GPUs today, since this behavior has not been preserved.
More relevant for your example, Nvidia hardware has historically not executed warps all at once, there used to be a concept of a "half warp", where only 16 threads actually executed at the same time, then the other 16, but you were still to treat the warp as a singular 32 thread "group", this is no longer true today with the latest hardware, but it's an example of what you can't rely on in hardware architecture. That kind of architecture alone could cause this:
const auto old = atomicAdd(&atomicCounters[threadIdx.x], val);
To not work with other thread blocks, but generally if any non uniformity in any other warps atomicAdd has the hypothetical potential to send any other single thread's out of sync in a CUDA compliant GPU architecture due to the constraints of atomic ordering.
To you're actual example:
I don't want to add a lock here, but just update the 3 counters in a synchronized manner. And I was thinking this should be safe provided I ensure all 3 values fall in the same cache-line.
If you're attempting to atomic add, then that means you need atomic ordering of additions, which means you expect other threads to spuriously add to the same values you do. If that's the case, you can't guarantee anything about when those additions happen and when they are reflected in a completely different SM and warp. The cuda programming model does not define what happens here.
If you want to have an arbitrary amount of synchronized data, you need locks to deal with them. This is not any different than how it works in host code really (large atomics may turn into implicit locks underneath in C++). See Cuda atomics change flag for custom lock, or this cuda thread for modern solutions using libcu++ CUDA: release lock implemented by atomic operations and make your critical data volatile.