parallel-processinggpgpuhlslcompute-shaderdirectcompute

Warp threads not SIMD synchronous


I'm going through the parallel reduction example from Nvidia. If tid < 32 then the threads are all meant to be in the same warp and so the instructions are suppose to be SIMD synchronous, so we can assume that sdata[tid] += sdata[tid + 32]; completes for all threads before sdata[tid] += sdata[tid + 16]; and so on. But this is not happening for me.

for (unsigned int s=groupDim_x/2; s>32; s>>=1) 
{ 
    if (tid < s) sdata[tid] += sdata[tid + s]; 
    GroupMemoryBarrierWithGroupSync(); 
}
if (tid < 32)
{ 
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid +  8]; 
    sdata[tid] += sdata[tid +  4];
    sdata[tid] += sdata[tid +  2];
    sdata[tid] += sdata[tid +  1]; 
}

The solution to the same problem on Cuda has already been posted (see), but it uses pointers and the volatile keyword. Directcompute doesn't have pointers and doesn't allow the volatile keyword on global memory.


Solution

  • Directcompute doesn't have pointers and doesn't allow the volatile keyword on global memory.

    Indeed, but it exposes comparable functionality as intrinsic functions. Replace += in your loop with InterlockedAdd intrinsic function and see what happens. However that function only works on integers.