cudaatomicuint64webgpuwgsl

How to correctly simulate `atomicAdd` on `u64` by using two `u32` buffers?


I'm trying to do atomic operations on u64. But since it's not supported, the number is stored in TWO u32 buffers

The issue is that I'm not sure how to do atomicAdd correctly to simulate the effect it would have had on u64. All while avoiding memory modification by other threads between loading and storing the values.

my current idea is this:


fn tou64(value: u32) -> vec2u {
        return vec2u(u32(value / BASE), value % BASE);
}

fn add(a: vec2u, b: vec2u) -> vec2u {
    let x = a.x + b.x + u32((a.y + b.y) / BASE);
    let y = (a.y + b.y) % BASE;
    return vec2u(x, y);
}

fn main() {
// .....

// convert the value from u32 to 2-buffer representation of u64
let b: vec2u = tou64(value);
// fetch the old value from the 2 buffers
var a = vec2u(0); 
a.x = atomicLoad(&buffer[index]);
a.y = atomicLoad(&buffer[index+1]);
// add the value to the buffer value
let result = add(a, b);
// store back the buffer results 
atomicStore(&buffer[index], result.x);
atomicStore(&buffer[index+1], result.y);
}

This works only when no other thread modifies the buffer at the same index. But it's a very weak implementation otherwise. Thread 1 could change the value of buffer[index+1] while thread 2 just read the old buffer[index] value and the new buffer[index+1] value

Edit: in CUDA's guide, it's noted that:

Note that any atomic operation can be implemented based on atomicCAS() (Compare And Swap)

and this example of AtomicAdd on double is provided

#if __CUDA_ARCH__ < 600
__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;

    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val +
                               __longlong_as_double(assumed)));

    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);

    return __longlong_as_double(old);
}
#endif

can this be applied to webgpu as well using atomicExchange? this answer shows how to do atomic operations on user-defined types. How can I do something similar but for webgpu?


Solution

  • Here is a solution for atomicAddU64, thanks to @PeterCordes!
    Manually do the carry between a u32 low half and a u32 high half, with each u32 addition being an atomic RMW. This requires the fetch_add / atomicAdd return value from the low half so we can check for carry-out.
    Since we only have 2 halves, not a wider integer, we don't need to handle carry-in and carry-out from the same add. Just the sum = a+b; carry_out = sum<b; trick.

    The total count at the end will be correct, but there's no way to read a correct snapshot of the current count while other threads are adding. Multiple threads might have added to the low half and still be waiting to add to the high half, for example.

    (It might be better to have each thread add into per-thread local accumulators to sum once at the end, instead of having all threads access a single shared variable all the time.)

    fn main() {
        // .....
    
        // convert the value from u32 or f32 to 2-buffer representation of u64
        let b: vec2u = tou64(value);
    
        // low: no need for modulu since it will wrap by itself
        // take old value on the buffer to check for carry
        let oldValue = atomicAdd(&buffer[low_index], b.y);
        // high: add high part + carry
        // check if the sum cause value to wrap
        atomicAdd(&buffer[high_index], b.x + u32((oldValue + b.y) < b.y));
    }
    

    To convert u64 to vec2u and back:

    // this is just a pseudocode! wgsl doesn't support u64 yet
    // do this operation in c++/js or while data processing
    fn u64ToVec2u(value: u64) -> vec2u {
        let low = u32(value);
        let high = u32(value >> 32);
        return vec2u(high, low);
    }
    
    fn vec2uToU64(value: vec2u) -> u64 {
        return (value.x << 32) + value.y; // x = high, y = low
    }
    
    

    and here's a solution for float computations as well, can be done in wgsl

    // to save 2 ^ 10 = 1024 (ie 3 digits after the decimal point)
    const DEGREE_TO_SAVE = 10; 
    fn tou64(value: f32) -> vec2u {
        // modulu is important here!! 
        // because converting from float to u32 will not automatically wrap
        let low = u32((value * pow(2., DEGREE_TO_SAVE)) % pow(2., 32));
        let high = u32(value /  pow(2., 32 - DEGREE_TO_SAVE));
        return vec2u(high, low);
    }
    
    fn tof32(value: vec2u) -> f32 {
        return f32(value.x) * pow(2., 32 - DEGREE_TO_SAVE) + f32(value.y) / pow(2., DEGREE_TO_SAVE);
    
    }