optimizationcudasynchronizationgpunvidia

What is the relationship between GPU thread occupancy and sychronization stalls?


I am writing a CUDA kernel with an inner loop that looks roughly like this:

for (int i = 0; i < NUM_ITERATIONS; i++)
{
  // read global memory, write shared memory
  __syncthreads();
  // read shared memory, do math
  __syncthreads();
}

For performance, I want to minimize the total amount of time threads spend waiting for other threads to arrive at the barrier synchronization. Will the number of threads per block affect the average amount of time a thread spends waiting at the barrier? Total amount of time all threads spend waiting? What if I have a low occupancy kernel (i.e. lots of registers per thread, low # of threads per block), are there any strategies that can help reduce synchronization waiting in this case?


Solution

  • Will the number of threads per block affect the average amount of time a thread spends waiting at the barrier?

    Likely. More threads per block certainly increase the worst case. Let's say you run 512 threads per block on an RTX 3080 (CUDA 8.6, 1536 threads per SM), then three warps (one per block) may block 45 other warps on the same SM from moving forward by being late.

    You should do your own benchmarks but last time I tested this, I ended up using smaller blocks, even if it reduced memory efficiency slightly.

    Total amount of time all threads spend waiting? What if I have a low occupancy kernel (i.e. lots of registers per thread, low # of threads per block)

    Well, low occupancy will probably reduce the wait time per thread simply because the threads have to spend less time waiting for an execution unit to be available. But you need to be careful when you do this. If your remaining kernels cannot use the full compute performance of the GPU or fully hide the latency, you will obviously lose.

    Implementing kernels that perform well with low occupancy is tricky. You need to have a lot of independent computations per thread. Last I checked, the basic matrix multiplication kernel in CuBLAS does this, using more shared memory per thread block than available for maximum occupancy.

    are there any strategies that can help reduce synchronization waiting in this case?

    You can use more shared memory to get rid of one of those barriers by using double buffering. Instead of doing this:

    __global__ void kernel()
    {
      __shared__ float data[N];
      for (int i = 0; i < NUM_ITERATIONS; i++)
      {
        data[threadIdx.x] = load_global();
        __syncthreads();
        float loc = data[y];
        __syncthreads();
      }
    }
    

    do this:

    __global__ void kernel()
    {
      __shared__ float data[2][N];
      for (int i = 0; i < NUM_ITERATIONS; i++)
      {
        data[i & 1][threadIdx.x] = load_global();
        __syncthreads();
        float loc = data[i & 1][y];
      }
    }
    

    You can extend this using Asynchronous Data Copies. The CUDA samples show how to do this in the globalToShmemAsyncCopy sample. I think the MatrixMulAsyncCopyMultiStage kernel is what you want.