I'm trying to learn Cuda and I'm having trouble to wrap my head around warps and lanes in them. A lane is the word I use for threads inside of the same warp. I can exchange data between lanes as follows:
int lane_id = threadIdx.x & 31;
if(lane_id != 0 && lane_id != 1){return;}
int val;
if(lane_id == 0) val = 42;
if(lane_id == 1) val = 123;
val = __shfl_sync(0b11, val, lane_id ^ 1);
if(lane_id == 0) assert(val == 123);
if(lane_id == 1) assert(val == 42);
I'm wondering whether the call to __shfl_sync
needs to be a single call or whether it can also be split over two calls like this:
int lane_id = threadIdx.x & 31;
if(lane_id != 0 && lane_id != 1){return;}
int val;
if(lane_id == 0) val = __shfl_sync(0b11, 42, 1);
if(lane_id == 1) val = __shfl_sync(0b11, 123, 0);
if(lane_id == 0) assert(val == 123);
if(lane_id == 1) assert(val == 42);
The answer is, it depends on the age of the architecture. The PTX documentation describes it in detail:
shfl.sync
will cause executing thread to wait until all non-exited threads corresponding tomembermask
have executedshfl.sync
with the same qualifiers and samemembermask
value before resuming execution.
[…]
For targetsm_6x
or below, all threads inmembermask
must execute the sameshfl.sync
instruction in convergence, and only threads belonging to somemembermask
can be active when theshfl.sync
instruction is executed. Otherwise, the behavior is undefined.
Therefore, for 7.x and above – essentially every architecture with independent thread scheduling – it is fine if instructions in different branches are used. For older architectures, all threads need to run the same instruction or have exited.