gpgpuhlslcompute-shaderdirectcompute

Num Threads trade-off in non-parallelizable work


I've been a good boy and parallelized my compute shader to execute 955 threads for 20 iterations

[numthreads(955, 1, 1)]
void main( uint3 pos : SV_DispatchThreadID )
{
    ...
    for (uint i = 0; i < 20; i++)
    {
        GroupMemoryBarrierWithGroupSync();
        //read from and write to groupshared memory
    }
}

But this isn't going to work out (because the parallelization introduces a realtime delay) so I have to do it a less parallel way. The easy way to approach the problem is to have 20 threads doing 955 iterations each

[numthreads(20, 1, 1)]
void main( uint3 pos : SV_DispatchThreadID )
{
    ...
    for (uint i = 0; i < 955; i++)
    {
        GroupMemoryBarrierWithGroupSync();
        //read from and write to groupshared memory
    }
}

However, I can't reason about how this is going to perform (probably terribly).

I under this new approach I must keep the number iterations the same, but can trade off the frequency which I call the compute shader with the number of threads. Which gives me two options:

Maybe accessing groupshared memory is very cheap and so I don't have a performance problem in the first place.

Maybe I should try to optimize this on the cpu (I've already tried unoptimized and the performance was less than desired).


Solution

  • Someone commented on this answer

    To be specific, a single-thread group will generally cap utilization to around 3-6%. Dispatching only one group compounds the issue, capping utilization to well under 1%. Sticking to 256 threads with power-of-two dimension sizes is a good rule of thumb, and you should dispatch at least 2048 or so threads total to keep the hardware busy.

    and I decided that doing this work on the gpu is a stupid thing to do. It's always best to look for robust solutions.

    The rubust solution for my problem is to use SIMD, which I will have to now learn the hard way.