ccudareducegpu-shared-memory

CUDA: Using grid-strided loop with reduction in shared memory


I have the following question concerning usage of grid-strided loops and optimized reduction algorithms in shared memory together in CUDA kernels. Imagine that you have 1D array with number of element more than threads in the grid (BLOCK_SIZE * GRID_SIZE). In this case you will write the kernel of this kind:

#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)

// ...

__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int i;
    
    // N is a total number of elements in the global_1D_array array
    for (i = idx; i < N; i += blockDim.x * gridDim.x)
    {
        // Do smth...
    }
}

Now you want to look for maximum element in the global_1D_array using reduction in shared memory and the above kernel will be look like this one:

#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)

// ...

__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int i;
    
    // Initialize shared memory array for the each block
    __shared__ double data[BLOCK_SIZE];
    
    // N is a total number of elements in the global_1D_array array
    for (i = idx; i < N; i += blockDim.x * gridDim.x)
    {
        // Load data from global to shared memory
        data[threadIdx.x] = global_1D_array[i];
        __syncthreads();
        
        // Do reduction in shared memory ...
    }
    
    // Copy MAX value for each block into global memory
}

It is clear that some values in the data will be overwritten, i.e. you need longer shared memory array or have to organize the kernel in another way. What is the best (most efficient) way to use reduction in shared memory and the strided loop together?

Thanks in advance.


Solution

  • A reduction using a grid strided loop is documented here. Referring to slide 32, the grid-strided loop looks like this:

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;
    while (i < n){
      sdata[tid] += g_idata[i] + g_idata[i+blockSize];
      i += gridSize;
      }
    __syncthreads();
    

    Note that each iteration of the while-loop increases the index by gridSize, and this while-loop will continue until the index (i) exceeds the (global) data size (n). We call this a grid-strided loop. In this example, the remainder of the threadblock-local reduction operation is not impacted by grid-size looping, thus only the "front-end" is shown. This particular reduction is doing a sum-reduction, but a max-reduction would simply replace the operation with something like:

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockSize + threadIdx.x;
    unsigned int gridSize = blockSize*gridDim.x;
    sdata[tid] = 0;
    while (i < n){
      sdata[tid] = (sdata[tid] < g_idata[i]) ? + g_idata[i]:sdata[tid];
      i += gridSize;
      }
    __syncthreads();
    

    And the remainder of the threadblock level reduction would have to be modified in a similar fashion, replacing the summing operation with a max-finding operation.

    The full parallel reduction CUDA sample code is available as part of any full cuda samples install, or here.