[SOLVED] OpenCL Compute units and GPU Processing units mismatch

OpenCL Compute units and GPU Processing units mismatch

I'm a bit confused about compute units. I have an nvidia gtx 1650Ti graphics card. When I asked for max_compute_units, it returns 16 units, and max_work_group_size is 1024. But when I executed the kernel:

int i = get_global_id (0);
result [i] = get_local_id (0);

I get the repeating local id range from 0 to 255. How does this relate to the max_compute_units returned by the graphics card? Is this an error in max_compute_units value and the gpu actually has more compute units than it indicates? Or does OpenCl get_local_id have its own distribution logic not tied to hardware? Thx!

Solution

OpenCL compute units refer to streaming multiprocessors (SMs) on Nvidia GPUs or compute units (CUs) on AMD GPUs. Each SM contains 128 CUDA cores (Pascal and earlier) or 64 CUDA cores (Turing/Volta). For AMD, each CU contains 64 streaming multiprocessors. This refers to the hardware. The more SMs/CUs, the faster the GPU (within the same microarchitecture).

The work group size / local ID refer to how you group threads in software into so-called thread blocks. Thread blocks are useful for matrix multiplications for example, because within a thread block, communication between threads is possible via shared memory. Thread blocks can have different size (sort of an optimization parameter, either 32, 64, 128, 256, 512 or 1024 (max_work_group_size)). Based on your GPU, some intermediate values might also work. On the hardware (at least for Nvidia), the thread blocks are executed as so-called warps (groups of 32 threads) on the SMs. For Turing, one SM can compute 2 warps simultaneously. If you choose the thread block size 16, then each warp only computes 16 threads and the other 16 are idle, so you only get half the performance.

In your example with the local ID (this is the index in the thread block) betwqeen 0 and 255, your thread block size is 256. You define the thread block size in the kernel call as the "local range". max_work_group_size does not correlate with max_compute_units in any way; both are hardware / driver limitations.