I've been a good boy and parallelized my compute shader to execute 955 threads for 20 iterations
[numthreads(955, 1, 1)]
void main( uint3 pos : SV_DispatchThreadID )
{
...
for (uint i = 0; i < 20; i++)
{
GroupMemoryBarrierWithGroupSync();
//read from and write to groupshared memory
}
}
But this isn't going to work out (because the parallelization introduces a realtime delay) so I have to do it a less parallel way. The easy way to approach the problem is to have 20 threads doing 955 iterations each
[numthreads(20, 1, 1)]
void main( uint3 pos : SV_DispatchThreadID )
{
...
for (uint i = 0; i < 955; i++)
{
GroupMemoryBarrierWithGroupSync();
//read from and write to groupshared memory
}
}
However, I can't reason about how this is going to perform (probably terribly).
I under this new approach I must keep the number iterations the same, but can trade off the frequency which I call the compute shader with the number of threads. Which gives me two options:
20 -> 32
to have a full warp.20 -> 32 * n
to have warps running in parallel.Maybe accessing groupshared memory is very cheap and so I don't have a performance problem in the first place.
Maybe I should try to optimize this on the cpu (I've already tried unoptimized and the performance was less than desired).
Someone commented on this answer
To be specific, a single-thread group will generally cap utilization to around 3-6%. Dispatching only one group compounds the issue, capping utilization to well under 1%. Sticking to 256 threads with power-of-two dimension sizes is a good rule of thumb, and you should dispatch at least 2048 or so threads total to keep the hardware busy.
and I decided that doing this work on the gpu is a stupid thing to do. It's always best to look for robust solutions.
The rubust solution for my problem is to use SIMD, which I will have to now learn the hard way.