OpenCL shared memory reduction correctness

While playing with OpenCL, I encountered a bug that I can't explain. Below is a reduction algorithm simply adapted to GPU like accelerators. You can see two versions of the reduction algorithm. V0 uses shared memory. V1 uses the work_group_reduce_<> feature of OpenCL 2.0. V0 fails when I use a work group larger than 64. Note that the shared memory array contains 128 unsigned int and I change this capacity to fit the size of the work group.

The V1 code works as expected for various work group sizes, at least powers of two and > 64.

V0 fails for work group sizes > 64. For example, the code below expects to run on work groups of 128 work items. Now, if I have the if(x_local_coordinate < i) check before the shared memory reduction, the code works as expected.

I am interested to know why V0 does not work as expected when what seems to me to be a redundant check (if(x_local_coordinate < i)) is not used and the kernel is run with a work group size > 64.

I have looked at similar implementation without understanding why the code below does not work. Maybe it comes from the host but then why does V0 always works.

Thank you.

typedef unsigned int Type;

kernel void ReduceBulk(global const Type* restrict buffer,
                       unsigned long buffer_size,
                       global Type* restrict reduction,
                       unsigned long sum_per_thread) {
    const unsigned long x_coordinate           = get_global_id(0);
    const unsigned long x_dimension_size       = get_global_size(0);
    const unsigned long x_local_dimension_size = get_local_size(0);
    const unsigned long x_local_coordinate     = get_local_id(0);

    Type sum = 0;

    for(unsigned long i = 0; i < sum_per_thread; ++i) {
        const unsigned long index = x_local_dimension_size * i + x_coordinate; // Coalesced

        // Dont check if out of bound
        // if(index >= buffer_size) {
        //     break;
        // }

        sum += buffer[index];
    }

    // V0, fails when WGs is > 64.

    local Type a_scratch_buffer[128];
    a_scratch_buffer[x_local_coordinate] = sum;

    if(x_local_coordinate < (x_local_dimension_size / 2)) {
        for(unsigned long i = x_local_dimension_size / 2; i != 0; i /= 2) {
            barrier(CLK_LOCAL_MEM_FENCE);

            // if(x_local_coordinate < i) {
                // Without this additional check (redundant due to the one before the for loop) we get wrong result when
                // using WGs > 64
                a_scratch_buffer[x_local_coordinate] += a_scratch_buffer[x_local_coordinate + i];
            // }
        }

        if(x_local_coordinate == 0) {
            atomic_add(reduction, a_scratch_buffer[0]);
        }
    }

    // // V1

    // barrier(0);

    // sum = work_group_reduce_add(sum);

    // if(x_local_coordinate == 0) {
    //     atomic_add(reduction, sum);
    // }
}

Solution

Actually, the check if(x_local_coordinate < i) is really necessary in the loop to avoid the data race.

Let's assume that we have one work group with the size 8 and let's review the execution of the following code for this work group. Please note that this code is executed in parallel for each work group item:

if(x_local_coordinate < (x_local_dimension_size / 2)) {
    for(unsigned long i = x_local_dimension_size / 2; i != 0; i /= 2) {
        barrier(CLK_LOCAL_MEM_FENCE);
        a_scratch_buffer[x_local_coordinate] += a_scratch_buffer[x_local_coordinate + i];
    }
    ...
}

First of all, the code contains the condition that is true only for first half of work group items. That means that only first four work group items will execute the loop at the first iteration when i equals (8 / 2) = 4:

At the second iteration, i will equal ((8 / 2) / 2) = 2 and the same first four work group items will execute the loop:

As you can see, the 3th and 4th elements are updated by 3th and 4th work group items and also these elements are read by 1st and 2nd work group items in the work group. So, it leads to the data race.

And let's review the execution of the code with the additional check:

if(x_local_coordinate < (x_local_dimension_size / 2)) {
    for(unsigned long i = x_local_dimension_size / 2; i != 0; i /= 2) {
        barrier(CLK_LOCAL_MEM_FENCE);
        if(x_local_coordinate < i) {
            a_scratch_buffer[x_local_coordinate] += a_scratch_buffer[x_local_coordinate + i];
        }
    }
    ...
}

The first iteration will be the same, because the outer and the inner conditions will be the same. But at the second iteration, when i equals 2 the behavior of work group items will be different. In accordance to the additional condition, only two work items (1st and 2nd) will execute the body of the additional if statement:

And it will help to avoid the data race.