What is the formula for highest-parallelism for global_work_size on OpenCL for a 1D workload?

Relevant information from clinfo:

  Max compute units                               10
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
  Preferred work group size multiple (kernel)     32

Say I want to "brute force" trying a bunch of random values. You're talking about basically in batches:

loop {
  clEnqueueNDRangeKernel(...)
}

Is there some kind of formula that can "per-hardware configuration" figure out ideal global_work_size and local_work_size for 1D workloads? Is there some rule of thumb?

let max_compute_units = 10;
let global_work_size = 256 * max_compute_units;
let local_work_size = 256;
assert!(global_work_size % local_work_size == 0);
assert!(local_work_size % 32 == 0);

To me, this leaves a lot on the table?

I've tried various combinations and I can't seem to figure out the right balance of host dispatches work to GPU for maximum parallelism" for a 1D workload

Solution

Local workgroup size should be 32 or a multiple of 32 here, otherwise you won't use the hardware fully and some cores per CU will remain idle on your device. Using workgroup size 16 on an Nvidia Pascal GPU for example will leave half of the CUDA cores idle at any time, and you only get half of the compute throughput. The best compatible, one-fits-all workgroup size across all devices is 64, as AMD GPUs need at least 64 or a multiple thereof.

If your OpenCL kernel doesn't use local memory, performance difference between workgroup sizes 32, 64, 128, and 256 is usually negligible.

1D global range should be much much larger than GPU core count for full hardware saturation. The exact value is not important but it has to be a clean multiple of the workgroup size, usually by rounding up the intended global range to the next multiple. It's not unusual to have global range in the Millions or even Billions; GPUs can handle this just fine.

Example: a GPU with 320 cores computes 320 threads at a time, and if you have global range of 512, the GPU needs 2 iterations of your kernel, and the second iteration only partially saturates it. On top comes ~10us dispatch latency from PCIe bus. Performance will be < 512/(2*320) = 80% of peak. But if you choose global range >> GPU core count, like 1000x higher, the incomplete last iteration and dispatch latency will become negligible compared to the several milliseconds of kernel execution time, and performance approaces 100% of peak.