c++cudathread-synchronizationwarp

Can threads in a warp synchronize with different calls to __shfl_sync?


I'm trying to learn Cuda and I'm having trouble to wrap my head around warps and lanes in them. A lane is the word I use for threads inside of the same warp. I can exchange data between lanes as follows:

int lane_id = threadIdx.x & 31;
if(lane_id != 0 && lane_id != 1){return;}

int val;
if(lane_id == 0) val = 42;
if(lane_id == 1) val = 123;

val = __shfl_sync(0b11, val, lane_id ^ 1);

if(lane_id == 0) assert(val == 123);
if(lane_id == 1) assert(val == 42);

I'm wondering whether the call to __shfl_sync needs to be a single call or whether it can also be split over two calls like this:

int lane_id = threadIdx.x & 31;
if(lane_id != 0 && lane_id != 1){return;}

int val;
if(lane_id == 0) val = __shfl_sync(0b11, 42, 1);
if(lane_id == 1) val = __shfl_sync(0b11, 123, 0);

if(lane_id == 0) assert(val == 123);
if(lane_id == 1) assert(val == 42);

Solution

  • The answer is, it depends on the age of the architecture. The PTX documentation describes it in detail:

    shfl.sync will cause executing thread to wait until all non-exited threads corresponding to membermask have executed shfl.sync with the same qualifiers and same membermask value before resuming execution.
    […]
    For target sm_6x or below, all threads in membermask must execute the same shfl.sync instruction in convergence, and only threads belonging to some membermask can be active when the shfl.sync instruction is executed. Otherwise, the behavior is undefined.

    Therefore, for 7.x and above – essentially every architecture with independent thread scheduling – it is fine if instructions in different branches are used. For older architectures, all threads need to run the same instruction or have exited.