cudashuffledouble-precision

CUDA's warp shuffle for double-precision data


A CUDA program should do reduction for double-precision data, I use Julien Demouth's slides named "Shuffle: Tips and Tricks"

the shuffle function is below:

/*for shuffle of double-precision point */
__device__ __inline__ double shfl(double x, int lane)
{
    int warpSize = 32;
    // Split the double number into 2 32b registers.
    int lo, hi;
    asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
    // Shuffle the two 32b registers.
    lo = __shfl_xor(lo,lane,warpSize);
    hi = __shfl_xor(hi,lane,warpSize);
    // Recreate the 64b number.
    asm volatile("mov.b64 %0,{%1,%2};":"=d"(x):"r"(lo),"r"(hi));
    return x;
}

At present, I got the errors below while compiling the program.

ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 71; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 271; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 287; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 302; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 317; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 332; error   : Arguments mismatch for instruction 'mov'
ptxas fatal   : Ptx assembly aborted due to errors
make: *** [csr_double] error 255

Could someone give some advice?


Solution

  • There is a syntax error in the inline assembly instruction for the load of the double argument to 32 bit registers. This:

    asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
    

    should be:

    asm volatile("mov.b64 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
    

    Using a "d" (ie 64 bit floating point register) as the source in a 32 bit load is illegal (and a mov.b32 makes no sense here, the code must load 64 bits to two 32 bit registers).