openclnon-deterministicgpu-atomics

Why does the OpenCL atomic_add implementation for float produce non-deterministic results?


I need to add a float to the same global memory address from within multiple threads in OpenCL. For any two simulation runs, the outcome is never identical and the calls to the atomic_add_f function are the source of this error. I'm using a Nvidia Titan Xp GPU with driver 436.02.

Since OpenCL does not support atomic_add with float, there are ways around using atomic_cmpxchg:

void atomic_add_f(volatile global float* addr, const float val) {
    union {
        uint  u32;
        float f32;
    } next, expected, current;
    current.f32 = *addr;
    do {
        next.f32 = (expected.f32=current.f32)+val; // ...*val for atomic_mul_f()
        current.u32 = atomic_cmpxchg((volatile global uint*)addr, expected.u32, next.u32);
    } while(current.u32!=expected.u32);
}

However, this code does produce a non-deterministic result. The results vary slightly in each run, similar to when a race condition would be present.

I also tried this version

void atomic_add_f(volatile global float* addr, const float val) {
    private float old, sum;
    do {
        old = *addr;
        sum = old+val;
    } while(atomic_cmpxchg((volatile global int*)addr, as_int(old), as_int(sum))!=as_int(old));
}

which does not work properly either. The version presented here does not work either.

How can this be and how to solve it?


Solution

  • Due to the way floating-point arithmetic works, (a + b) + c and a + (b + c) do not necessarily produce the exact same result. Intermediate results are always truncated or rounded. As the different work-items of your kernel do not run in a deterministic order, your sum therefore won't be deterministic.

    Wikipedia provides some examples of floating-point calculations which do not produce identical results depending on associativity.

    Possible solutions:

    Note that OpenCL does not mandate any specific rounding behaviour, so even if you change your accumulation to be deterministic, the rest of your algorithm will most likely not produce consistent results across different OpenCL implementations. If you absolutely must obtain identical results for identical inputs under all circumstances, don't use floating-point arithmetic, use appropriately sized integers.