openclreduction

OpenCL sum `cl_khr_fp64` double values into a single number


From this question and this question I managed to compile a minimal example of summing a vector into a single double inside OpenCL 1.2.

    /* https://suhorukov.blogspot.com/2011/12/opencl-11-atomic-operations-on-floating.html */
    inline void AtomicAdd(volatile __global double *source, const double operand) {
      union { unsigned int intVal; double floatVal; } prevVal, newVal;
      do {
        prevVal.floatVal = *source;
        newVal.floatVal = prevVal.floatVal + operand;
      } while( atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal );
    }

    void kernel cost_function(__constant double* inputs, __global double* outputs){
      int index = get_global_id(0);

      if(0 == error_index){ outputs[0] = 0.0; }
      barrier(CLK_GLOBAL_MEM_FENCE);

      AtomicAdd(&outputs[0], inputs[index]); /* (1) */
      //AtomicAdd(&outputs[0], 5.0); /* (2) */

    }

As in fact this solution is incorrect because the result is always 0 when the buffer is accessed. What might the problem with this?

the code at /* (1) */ doesn't work, and neither does the code at /* (2) */, which is only there to test the logic independent of any inputs.

Is barrier(CLK_GLOBAL_MEM_FENCE); used correctly here to reset the output before any calculations are done to it?

According to the specs in OpenCL 1.2 single precision floating point numbers are supported by atomic operations, is this(AtomicAdd) a feasible method of extending the support to double precision numbers or am I missing something?

Of course the device I am testing with supports cl_khr_fp64˙of course.


Solution

  • Your AtomicAdd is incorrect. Namely, the 2 errors are:

    1. In the union, intVal must be a 64-bit integer and not 32-bit integer.
    2. Use the 64-bit atom_cmpxchg function and not the 32-bit atomic_cmpxchg function.

    The correct implementation is:

    #pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
    inline void AtomicAdd(volatile __global double *source, const double operand) {
        union { unsigned ulong u64; double f64; } prevVal, newVal;
        do {
            prevVal.f64 = *source;
            newVal.f64 = prevVal.f64 + operand;
        } while(atom_cmpxchg((volatile __global ulong*)source, prevVal.u64, newVal.u64) != prevVal.u64);
    }
    

    barrier(CLK_GLOBAL_MEM_FENCE); is used correctly here. Note that a barrier must not be in an if- or else-branch.

    UPDATE: According to STREAMHPC, the original implementation you use is not guaranteed to produce correct results. There is an improved implementation:

    void __attribute__((always_inline)) atomic_add_f(volatile global float* addr, const float val) {
        union {
            uint  u32;
            float f32;
        } next, expected, current;
        current.f32 = *addr;
        do {
            next.f32 = (expected.f32=current.f32)+val; // ...*val for atomic_mul_f()
            current.u32 = atomic_cmpxchg((volatile global uint*)addr, expected.u32, next.u32);
        } while(current.u32!=expected.u32);
    }
    
    #ifdef cl_khr_int64_base_atomics
    #pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
    void __attribute__((always_inline)) atomic_add_d(volatile global double* addr, const double val) {
        union {
            ulong  u64;
            double f64;
        } next, expected, current;
        current.f64 = *addr;
        do {
            next.f64 = (expected.f64=current.f64)+val; // ...*val for atomic_mul_d()
            current.u64 = atom_cmpxchg((volatile global ulong*)addr, expected.u64, next.u64);
        } while(current.u64!=expected.u64);
    }
    #endif