Following these slides https://www.nvidia.com/content/GTC-2010/pdfs/2260_GTC2010.pdf when doing a parallel reduction you should avoid GroupMemoryBarrierWithGroupSync
when reducing the last 2*WaveGetLaneCount()
elements
for (unsigned int s=groupDim_x/2; s>32; s>>=1)
{
if (tid < s) sdata[tid] += sdata[tid + s];
GroupMemoryBarrierWithGroupSync();
}
if (tid < 32)
{
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
But the problem with this is the compiler doesn't write back to shared memory. It essentially does
if (tid < 32)
{
float ldata = sdata[tid];
ldata += sdata[tid + 32];
ldata += sdata[tid + 16];
ldata += sdata[tid + 8];
ldata += sdata[tid + 4];
ldata += sdata[tid + 2];
ldata += sdata[tid + 1];
sdata[tid] = ldata;
}
https://hlsl.godbolt.org/z/9o4r4nvnc
How to fix? One way is to prefix everything with if (tid < s)
. Is there anything less hacky?
With Shader Model 6, you should use the wave intrinsics WaveReadLaneAt
and WaveGetLaneIndex
WaveActiveSum
.
The simple, but sad, answer is, that this GDC talk is simply outdated and relied on undefined behaviour.
The HLSL specs state:
threadgroup memory is denoted in hlsl with the
groupshared
keyword. [...] Reads and writes to threadgroup Memory, may occur in any order except as restricted by synchronization intrinsics or other memory annotations.
( https://microsoft.github.io/hlsl-specs/specs/hlsl.html )
So without synchronisation through group barriers, or other memory annotations, the compiler is free to do this sort of optimisation. Older dxc versions might not have done this - however, even if dxc doesn't do this kind of optimisation, it might still be done by the driver when creating the pipeline (the AMD driver in particular likes to do this)