cuda

How to correctly use __shfl_up_sync for data transfer among only a few lanes


Suppose I need to shuffle some value from lane 29 to 30 and also from 30 to 31; what mask should I use?

Of course the mask should cover lane 30, 31 AND 29 (cuda guide says the source lane also needs to participate); but then the command must be called on lane 29 -- so do I need to continually expand the mask (finally to the full mask in this case) or do I just use 0xE0000000 and let the undefined result be returned for lane 29 (it doesn't need the result anyway)?


Solution

  • The way a __shfl works is that every (partaking) thread presents its own value first and after that every threads takes a value from the pool. So, only the source is given, the dest, is always the asking thread.

    You can model this using shared memory, I find it easier to picture it that way:

    int shared_shfl(int old, int source) {
        //note shfl uses neither shared memory nor registers
        __shared__ int warpdata[32]; //one for every thread in the warp
        //step one: every thread presents its data
        warpdata[threadIdx.x] = old; 
        //step 2 synchronize
        __syncwarp(); 
        //step 3: get your data. 
        //Every thread can choose a different source, dest is always self.
        const auto new = warpdata[source]; 
        return new;
    }
     
    

    OK, let's list your requirements

    29->29  //dest 29, offset = 0
    29->30  //dest 30, offset = 1
    30->31  //dest 31, offset = 1
    

    It is perfectly valid for different threads to have a different offset.

    The code will look like this:

    #include <stdio.h>
    #include <cuda.h>
    
    
    __global__ void shfl_29() {
        auto dummydata = threadIdx.x % 32;
        const auto activemask = 0b111u << 29;
        const auto laneid = threadIdx.x % 32;
        const auto offset = int(laneid > 29); //29 = 0, else 1
        //the if statement must match the activemask, or your code will A: deadlock or B: not work.
        if (laneid >= 29) { 
            dummydata = __shfl_up_sync(activemask, dummydata, offset);
            printf("old: %i, new: %i\n", laneid, dummydata);
        }
        
    }
    
    int main() {
        shfl_29<<<1,256>>>();
        cudaDeviceSynchronize();   
    }
    
    

    You can see it in action on Godbolt: https://cuda.godbolt.org/z/6E9bGbx1f

    However, you need to be aware that using anything other than -1 as the activemask makes the shfl (and really any sync instruction) much slower.

    A faster way to do this would be:

    __global__ void shfl_29() {
        auto dummydata = threadIdx.x % 32;
        constexpr auto all = -1u;
        const auto laneid = threadIdx.x % 32;
        const auto offset = int(laneid > 29); //<=29? = 0, else 1
        dummydata = __shfl_up_sync(all, dummydata, offset);
        printf("old: %i, new: %i\n", laneid, dummydata);
    }
    
    

    If you time this, you'll see that the latter version runs about 10x faster, see: https://cuda.godbolt.org/z/c6q66P94h