Following these slides https://www.nvidia.com/content/GTC-2010/pdfs/2260_GTC2010.pdf when doing a parallel reduction you should avoid GroupMemoryBarrierWithGroupSync when reducing the last 2*WaveGetLaneCount() elements
for (unsigned int s=groupDim_x/2; s>32; s>>=1)
{
if (tid < s) sdata[tid] += sdata[tid + s];
GroupMemoryBarrierWithGroupSync();
}
if (tid < 32)
{
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
But the problem with this is the compiler doesn't write back to shared memory. It essentially does
if (tid < 32)
{
float ldata = sdata[tid];
ldata += sdata[tid + 32];
ldata += sdata[tid + 16];
ldata += sdata[tid + 8];
ldata += sdata[tid + 4];
ldata += sdata[tid + 2];
ldata += sdata[tid + 1];
sdata[tid] = ldata;
}
https://hlsl.godbolt.org/z/9o4r4nvnc
How to fix? One way is to prefix everything with if (tid < s). Is there anything less hacky?
With Shader Model 6, you should use the wave intrinsics WaveReadLaneAt and WaveGetLaneIndexWaveActiveSum.
The simple, but sad, answer is, that this GDC talk is simply outdated and relied on undefined behaviour.
The HLSL specs state:
threadgroup memory is denoted in hlsl with the
groupsharedkeyword. [...] Reads and writes to threadgroup Memory, may occur in any order except as restricted by synchronization intrinsics or other memory annotations.
( https://microsoft.github.io/hlsl-specs/specs/hlsl.html )
So without synchronisation through group barriers, or other memory annotations, the compiler is free to do this sort of optimisation. Older dxc versions might not have done this - however, even if dxc doesn't do this kind of optimisation, it might still be done by the driver when creating the pipeline (the AMD driver in particular likes to do this)