I have read Parallel Thread Execution ISA: Parallel Synchronization and Communication Instructions: bar, barrier which details PTX synchronization functions.
It says there are 16 "barrier logical resource", and you can specify which barrier to use with the parameter a
. What is a barrier logical resource?
I have a piece of code from an outside source, which I know works. However, I cannot understand the syntax used inside asm
and what "memory"
does. I assume name
replaces %0
and numThreads
replaces %1
, but what is "memory"
and what are the colons doing?
__device__ __forceinline__ void namedBarrierSync(int name, int numThreads) {
asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(numThreads) : "memory");}
In a block of 256 threads, I only want threads 64 ~ 127 to synchronize. Is this possible with the barrier.sync
function?
For an example, say I have a grid of 1 block, block of 256 threads. We split the block into 3 conditional branches s.t. threads 0 ~ 63 go into kernel1, threads 64 ~ 127 go into kernel 2, and threads 128 ~ 255 go into kernel 3. I want threads in kernel 2 to only synchronize among themselves. So if I use the namedBarrierSync
function defined above: namedBarrierSync(1, 64)
. Then does it synchronize only threads 64 ~ 127, or threads 0 ~ 63?
I have tested with below code (assume that gpuAssert()
is an error checking function defined somewhere in the file).
Here is the code:
__global__ void test(int num_threads)
{
if (threadIdx.x >= 64 && threadIdx.x < 128)
{
namedBarrierSync(0, num_threads) ;
}
__syncthreads();
}
int main(void)
{
test<<<1, 1, 256>>>(128);
gpuAssert(cudaDeviceSynchronize(), __FILE__, __LINE_);
printf("complete\n");
return 1;
}
barrier.sync
with a named barrier and thread count of 64 synchronizes the first two warps arriving at the named barrier (for compute capability up to 6.x) or the first 64 threads arriving at the named barrier (for compute capability 7.0 onwards).test<<<1, 256>>>(128);
instead.