[SOLVED] Parallel reduction with single wave

Parallel reduction with single wave

Following these slides https://www.nvidia.com/content/GTC-2010/pdfs/2260_GTC2010.pdf when doing a parallel reduction you should avoid GroupMemoryBarrierWithGroupSync when reducing the last 2*WaveGetLaneCount() elements

for (unsigned int s=groupDim_x/2; s>32; s>>=1) 
{ 
    if (tid < s) sdata[tid] += sdata[tid + s]; 
    GroupMemoryBarrierWithGroupSync(); 
}
if (tid < 32)
{ 
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid +  8]; 
    sdata[tid] += sdata[tid +  4];
    sdata[tid] += sdata[tid +  2];
    sdata[tid] += sdata[tid +  1]; 
}

But the problem with this is the compiler doesn't write back to shared memory. It essentially does

if (tid < 32)
{ 
    float ldata = sdata[tid];
    ldata += sdata[tid + 32];
    ldata += sdata[tid + 16];
    ldata += sdata[tid +  8]; 
    ldata += sdata[tid +  4];
    ldata += sdata[tid +  2];
    ldata += sdata[tid +  1];
    sdata[tid] = ldata;
}

https://hlsl.godbolt.org/z/9o4r4nvnc

How to fix? One way is to prefix everything with if (tid < s). Is there anything less hacky?

Side note

With Shader Model 6, you should use the wave intrinsics ~~WaveReadLaneAt and WaveGetLaneIndex~~ WaveActiveSum.

Solution

The simple, but sad, answer is, that this GDC talk is simply outdated and relied on undefined behaviour.

The HLSL specs state:

threadgroup memory is denoted in hlsl with the groupshared keyword. [...] Reads and writes to threadgroup Memory, may occur in any order except as restricted by synchronization intrinsics or other memory annotations.

( https://microsoft.github.io/hlsl-specs/specs/hlsl.html )

So without synchronisation through group barriers, or other memory annotations, the compiler is free to do this sort of optimisation. Older dxc versions might not have done this - however, even if dxc doesn't do this kind of optimisation, it might still be done by the driver when creating the pipeline (the AMD driver in particular likes to do this)