cperformancecudacomparisonbranchless

Compare in CUDA without branching


I am trying to implement the following function in CUDA:

int compare(unsigned a, unsigned b) {
    if (a == b) {
        return 0;
    } else {
        if (a < b) return -1;
        else return 1;
    }
}

I am currently using a pretty naive macro

#define CMP(X, Y) (((X) == (Y)) ? 0 : (((X) < (Y)) ? -1 : 1))

but I am wondering if it's causing divergence due to the branching. Is there any better way to implement this function in CUDA?


Solution

  • You could use a branch-less equivalent, that is:

    (a > b) - (a < b)
    

    This solves potential warp divergence.

    In your code, nvcc compiler may eliminate divergence anyway, with usage of branch predication. But, even with this technique, some threds in warp may be inactive. You might observe this in Thread Execution Efficiency column in NSight Visual Studio profiler for particular statement in your code.