cuda gpu nvidia openacc dynamic-parallelism

Nested Directives in OpenACC

I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7.

My code is so simple. When I try to compile following code compiler returns me these messages

PGCC-S-0155-Illegal context for pragma: acc  parallel loop (test.cpp: 158)
PGCC/x86 Linux 15.7-0: compilation completed with severe errors

My code structure:

#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
  // << computation >>

  int ss = A[tid].start;
  int ee = A[tid].end;

  #pragma acc parallel loop
  for(j = ss; j< ( ee + ss); j++)
  {
    // << computation >>
  }

I've also tried to change my code to use routine directives. But I couldn't compile again

#pragma acc routine workers
foo(...)
{

  #pragma acc parallel loop
  for(j = ss; j< ( ee + ss); j++)
  {
    // << computation >>
  }
}

#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
  // << computation >>

  int ss = A[tid].start;
  int ee = A[tid].end;

  foo(...);

}

I've tried of course only with routine (seq,worker,gang) without inner parallel loop directive. It has been compiler but dynamic parallelism hasn't been activated.

    37, Generating acc routine worker
         Generating Tesla code
         42, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
         Loop is parallelizable

How am I supposed to use dynamic parallelism in OpenACC?

Solution

How am I supposed to use dynamic parallelism in OpenACC?

Although nested regions (which would presumably use dynamic parallelism) is a new feature in the OpenACC 2.0 specification, I don't believe it is implemented yet in PGI 15.7. PGI 15.7 represents a partial implementation of the OpenACC 2.0 specification.

This limitation is documented in the PGI 15.7 release notes that should ship with your PGI 15.7 compiler (pgirn157.pdf) in section 2.7 (those release notes are currently available here):

OpenACC 2.0 Missing Features

‣ The declare link directive for global data is not implemented.

‣ Nested parallelism (parallel and kernels constructs within a parallel or kernels region) is not implemented.

Based on the comments, there is some concern about #pragma acc routine worker, so here is a fully worked example with PGI 15.7 of that:

$ cat t1.c
#include <stdio.h>
#include <stdlib.h>
#define D1 4096
#define D2 4096
#define OFFS 2

#pragma acc routine worker
void my_set(int *d, int len, int val){
  int i;
  for (i = 0; i < len; i++) d[i] += val+OFFS;
}

int main(){


  int i,*data;
  data = (int *)malloc(D1*D2*sizeof(int));
  for (i = 0; i < D1*D2; i++) data[i] = 1;

#pragma acc kernels copy(data[0:D1*D2])
  for (i = 0; i < D1; i++)
    my_set(data+(i*D2), D2, 1);

  printf("%d\n", data[0]);

  return 0;
}
$ pgcc -acc -ta=tesla -Minfo=accel t1.c -o t1
my_set:
      8, Generating acc routine worker
         Generating Tesla code
         10, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
         Loop is parallelizable
main:
     20, Generating copy(data[:16777216])
     21, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         21, #pragma acc loop gang /* blockIdx.x */
$ ./t1
4
$

Note that the gang parallelism has been performed at the outer loop, and the worker parallelism has been performed in the inner (routine) loop.

This method does not depend on dynamic parallelism (instead, it relies on a partitioning of parallelism between worker at the routine level and gang at the caller level) and will not invoke dynamic parallelism.

The native use of dynamic parallelism (CDP) is not supported currently in PGI 15.7. It should be possible to call (i.e. interoperate with) other functions (e.g. CUDA, or libraries) that make use of CDP from OpenACC code, but currently, natively it is not used (and not supported) in PGI 15.7