cudanvidiakepler

Is the nvidia kepler shuffle "destructive"?


I'm using the implementation of the parallel reduction on CUDA using new kepler's shuffle instructions, similar to this: http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/

I was searching for the minima of rows in a given matrix, and in the end of the kernel I had the following code:

my_register = min(my_register, __shfl_down(my_register,8,16));
my_register = min(my_register, __shfl_down(my_register,4,16));
my_register = min(my_register, __shfl_down(my_register,2,16));
my_register = min(my_register, __shfl_down(my_register,1,16));

My blocks are 16*16, so everything worked fine, with that code I was getting minima in two sub-rows in the very same kernel.

Now I also need to return the indices of the smallest elements in every row of my matrix, so I was going to replace "min" with the "if" statement and handle these indices in a similar fashion, I got stuck at this code:

if (my_reg > __shfl_down(my_reg,8,16)){my_reg = __shfl_down(my_reg,8,16);};
if (my_reg > __shfl_down(my_reg,4,16)){my_reg = __shfl_down(my_reg,4,16);};
if (my_reg > __shfl_down(my_reg,2,16)){my_reg = __shfl_down(my_reg,2,16);};
if (my_reg > __shfl_down(my_reg,1,16)){my_reg = __shfl_down(my_reg,1,16);};

No cudaErrors whatsoever, but kernel returns trash now. Nevertheless I have fix for that:

myreg_tmp = __shfl_down(myreg,8,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,4,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,2,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,1,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};

So, allocating new tmp variable to sneak into neighboring registers saves everything for me. Now the question: Are the kepler shuffle instructions destructive ? in a sense that invoking same instruction twice doesn't issue the same result. I haven't assigned anything to those registers saying "my_reg > __shfl_down(my_reg,8,16)" - this adds up to my confusion. Can anyone explain me what is the problem with invoking shuffle twice? I'm pretty much a newbie in CUDA, so detailed explanation for dummies is welcomed


Solution

  • warp shuffle is not destructive. The operation, if repeated under the exact same conditions, will return the same result each time. The var value (myreg in your example) does not get modified by the warp shuffle function itself.

    The problem you are experiencing is due to the fact that the number of participating threads on the second invocation of __shfl_down() in your first method is different than the other invocations, in either method.

    First, let's remind ourselves of a key point in the documentation:

    Threads may only read data from another thread which is actively participating in the __shfl() command. If the target thread is inactive, the retrieved value is undefined.

    Now let's take a look at your first "broken" method:

    if (my_reg > __shfl_down(my_reg,8,16)){my_reg = __shfl_down(my_reg,8,16);};
    

    The first time you call __shfl_down() above (within the if-clause), all threads are participating. Therefore all values returned by __shfl_down() will be what you expect. However, once the if clause is complete, only threads that satisfied the if-clause will participate in the body of the if-statement. Therefore, on the second invocation of __shfl_down() within the if-statement body, only threads for which their my_reg value was greater than the my_reg value of the thread 8 lanes above them will participate. This means that some of these assignment statements probably will not return the value you expect, because the other thread may not be participating. (The participation of the thread 8 lanes above would be dependent on the result of the if comparison done by that thread, which may or may not be true.)

    The second method you propose has no such issue, and works correctly according to your statements. All threads participate in each invocation of __shfl_down().