c++gpuopenaccpgi

difference between kernels and parallel directives in OpenAcc standard


I have already been using the PGI compiler supporting OpenAcc for launching the code on GPU for about 3 years, but i can not understand the difference between the terms "kernels" and "parallel" so far. I read in OpenAcc Getting Started Guide:

Parallel Construct

Defines the region of the program that should be compiled for parallel execution on the accelerator device.

Kernels Construct

Defines the region of the program that should be compiled into a sequence of kernels for execution on the accelerator device.

I do not understand what is the difference between the terms "parallel execution on the accelerator device" and "compiled into a sequence of kernels for execution on the accelerator device". If the accelerator device is a GPU, then all the code is compiled into CUDA kernels of some size (i try to mean CUDA grid and blocks) and these CUDA kernels are executed on GPU in CUDA threads, isn't it? What is a "sequence" of kernels? The "parallel" directive makes 1 kernel and "kernels" can make a sequence of kernels from the same piece of code?

Also i use only the "parallel" loop directive everywhere. For example, to parallelize a for loop for execution on GPU, i write

#pragma acc parallel loop gang vector copy(...) present(...)
  for(int i=0; i<N; ++i)
  {
    ...
  }

Is it correct? When "kernels" should be used? Or it is a synonym for "parallel" and now is deprecated and not used?


Solution

  • The best way to think of the difference is that with "parallel", you the programmer are defining which loops to parallelize and how. Basically you're telling the compiler to parallelize particular loops. With "kernels", you're defining a region of code which may be parallelized but it's the compiler's job to then determine which loops to parallelize and how.

    For "parallel", all code within the region is offloaded as one CUDA kernel. If you have multiple outer loops within the "parallel" region, they will still be offloaded in one CUDA kernel. Since the compiler can discover the parallelize with "kernels", multiple loops within this region may be split into a sequence of separate CUDA kernel launches.

    Full details can be found at: https://www.pgroup.com/lit/articles/insider/v4n2a1.htm

    Note that access to the article does require you to have a PGI Web User account.