opencl

OpenCL group size meaning


What does the local group size mean for OpenCL? If the work groupsize is 1, does it mean that only one thread is running at the same time?

I found we can set NULL to cl_ndrange to let the program automatically select the group size, how can we know what group size is selected?


Solution

  • In an OpenCL kernel:

    get_local_size(0);
    

    gives you "8" for a 8x16 work-group-sized kernel while

    get_local_size(1);
    

    gives "16" for the same kernel execution.

    If it is a 3D kernel, then "2" as a parameter gives the third dimension's size. To query the number of dimensions (1,2,3,...):

     get_work_dim();
    

    The local work group size gives the number of threads per hardware synchronization. For AMD GPUs, the maximum is 256 while Nvidia and CPUs can go up to 1024. When you give "1" as this work group size, only a single thread per compute unit is obtained (if not restricted by a min-value from drivers or hardware) and maximum local memory per thread (also per wave) is achieved.

    At the same time? The whole execution is folded onto a handful of compute units, letting multiple issued execution and increasing occupation, especially if the workgroup size is small and local memory is enough. If L=1 lets MAX_VALUE waves fold onto same compute unit, L=64 lets only 16 waves to fold, L=256 lets only 4 waves fold onto, for example. Yes, at the same time but not at the same synchronization scope.

    Some vendors' some models can do concurrent kernel executions to fully occupy pipelines so that single thread may not be alone but with another thread from another kernel of another commandqueue.

    For example, the upcoming r9 390x gpu will have 64 compute units each having maybe 4 vector units each having 16 arithmetics&fpu units totaling 4096 cores. AMD's compute unit has 64 cores total, a core is being streamed by around 4 several waves so having 256 threads per compute unit. But this threading system could be different than those on CPUs so context swithcing maybe faster and cache contention is minimum because of switching in groups.