Warp shuffling for CUDA

I need to make a warp shuffling that look like this:

On this picture, the number of threads is limited to 8 to make it readable. If I read the Nvidia SDK and ptx manual, the shuffle instruction should do the job, specially the shfl.idx.b32 d[|p], a, b, c; ptx instruction.

From the manual I read:

Each thread in the currently executing warp will compute a source lane
index j based on input operands b and c and the mode. If the computed
source lane index j is in range, the thread will copy the input operand
a from lane j into its own destination register d;

So, providing proper values of b and c, I should be able to do it by writing a function like this (inspired from CUDA SDK __shufl primitive implementation).

  __forceinline__ __device __ float shuffle(float var){
   float ret;
   int srcLane = ???
   int c = ???
   asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=f"(ret) : "f"(var), "r"(srcLane), "r"(c));
  return ret;

}

If it is possible, what is the constant for srcLane and c? I am not able to determine them (I am using CUDA 8.0) .

Best,

Timocafe

Solution

I would recommend doing this with the CUDA intrinsic rather than with PTX (or inline ASM). However the following code demonstrates both methods:

// cat t54.cu
#include <stdio.h>

__global__ void k(){
    int i = threadIdx.x;
    int j = i;
    if (i<4) j*=2;
    if ((i>3) && (i<8)) j-=(7-i);
    int k = __shfl_sync(0x0FFU, i+100, j);
    printf("lane: %d, result: %d\n", i, k);
}

__forceinline__ __device__ float shuffle(float var, int lane){
    float ret;
    int srcLane = lane;
    int c = 0x1F;
    asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=f"(ret) : "f"(var), "r"(srcLane), "r"(c));
    return ret;
}

__global__ void k1(){
    int i = threadIdx.x;
    int j = i;
    if (i<4) j*=2;
    if ((i>3) && (i<8)) j-=(7-i);
    float k = shuffle((float)(i+100), j);
    printf("lane: %d, result: %f\n", i, k);
}


int main(){
    k<<<1,8>>>();
    cudaDeviceSynchronize();
    k1<<<1,8>>>();
    cudaDeviceSynchronize();
}

$ nvcc -arch=sm_35 -o t54 t54.cu
$ cuda-memcheck ./t54
========= CUDA-MEMCHECK
lane: 0, result: 100
lane: 1, result: 102
lane: 2, result: 104
lane: 3, result: 106
lane: 4, result: 101
lane: 5, result: 103
lane: 6, result: 105
lane: 7, result: 107
lane: 0, result: 100.000000
lane: 1, result: 102.000000
lane: 2, result: 104.000000
lane: 3, result: 106.000000
lane: 4, result: 101.000000
lane: 5, result: 103.000000
lane: 6, result: 105.000000
lane: 7, result: 107.000000
========= ERROR SUMMARY: 0 errors
$

Using the CUDA intrinsic (the first method) the only real task is to compute the source lane index. Based on your pattern I wrote some code to do that and put it in the variable j.