cudasynchronizationinline-assemblybarrierptx

CUDA: how to use barrier.sync


I have read Parallel Thread Execution ISA: Parallel Synchronization and Communication Instructions: bar, barrier which details PTX synchronization functions.

  1. It says there are 16 "barrier logical resource", and you can specify which barrier to use with the parameter a. What is a barrier logical resource?

  2. I have a piece of code from an outside source, which I know works. However, I cannot understand the syntax used inside asm and what "memory" does. I assume name replaces %0 and numThreads replaces %1, but what is "memory" and what are the colons doing?

     __device__ __forceinline__ void namedBarrierSync(int name, int numThreads) {
     asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(numThreads) : "memory");}
    
  3. In a block of 256 threads, I only want threads 64 ~ 127 to synchronize. Is this possible with the barrier.sync function?

    For an example, say I have a grid of 1 block, block of 256 threads. We split the block into 3 conditional branches s.t. threads 0 ~ 63 go into kernel1, threads 64 ~ 127 go into kernel 2, and threads 128 ~ 255 go into kernel 3. I want threads in kernel 2 to only synchronize among themselves. So if I use the namedBarrierSync function defined above: namedBarrierSync(1, 64). Then does it synchronize only threads 64 ~ 127, or threads 0 ~ 63?

  4. I have tested with below code (assume that gpuAssert() is an error checking function defined somewhere in the file).

Here is the code:

__global__ void test(int num_threads) 
{
    if (threadIdx.x >= 64 && threadIdx.x < 128) 
    {
        namedBarrierSync(0, num_threads) ;
    }
    __syncthreads();
}

int main(void) 
{
    test<<<1, 1, 256>>>(128);
    gpuAssert(cudaDeviceSynchronize(), __FILE__, __LINE_);
    printf("complete\n");
    return 1;
}

Solution

    1. "barrier logical resource" are the hardware necessary to synchronize threads/warps in a thread block (probably atomic counters etc.). You don't need to know the actual hardware implementation to program them, it is sufficient to know there are 16 instances of them available.
    2. As Robert Crovella has pointed out in your cross-post on the Nvidia forum, the documentation for inline PTX is at https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html.
    3. barrier.sync with a named barrier and thread count of 64 synchronizes the first two warps arriving at the named barrier (for compute capability up to 6.x) or the first 64 threads arriving at the named barrier (for compute capability 7.0 onwards).
    4. Your test only launches a single thread (with 256 bytes of shared memory allocated to it), which makes tests of synchronisation instructions moot. You want to launch the test kernel as test<<<1, 256>>>(128); instead.