cudathrustreductioncub

Reduction in CUDA


I'm just starting to learn CUDA programming, and I have some confusion about reduction.

I know that global memory has much visiting delay as compared to shared memory, but can I use global memory to (at least) simulate a behavior similar to shared memory?

For example, I want to sum the elements of a large array whose length is exactly BLOCK_SIZE * THREAD_SIZE (both the dimensions of grid and block are power of 2), and I've tried to use the code below:

    __global__ void parallelSum(unsigned int* array) {

    unsigned int totalThreadsNum = gridDim.x * blockDim.x;
    unsigned int idx = blockDim.x * blockIdx.x + threadIdx.x;

    int i = totalThreadsNum / 2;
    while (i != 0) {
            if (idx < i) {
                array[idx] += array[idx + i];
        }
        __syncthreads();
        i /= 2;
    }
}

I compared the result of this code and the result generated serially on the host, and the weird thing is: sometimes the results are the same, but sometimes they are apparently different. Is there any reason related to using global memory here?


Solution

  • Your best bet is to start with the reduction example in the CUDA Samples. The scan example is also good for learning the principles of parallel computing on a throughput architecture.

    That said, if you actually just want to use a reduction operator in your code then you should look at Thrust (calling from host, cross-platform) and CUB (CUDA GPU specific).

    To look at your specific questions:

    The last point is the most important. If a thread in block X wants to read data that is written by block Y then you need to break this across two kernel launches, that's why a typical parallel reduction takes a multi-phase approach: reduce batches within blocks, then reduce between the batches.