openacc

How to have the same routine executed sometimes by the CPU and sometimes by the GPU with OpenACC?


I'm dealing with a routine which I want the first time to be executed by the CPU and every other time by the GPU. This routine contains the loop:

for (k = kb; k <= ke; k++){
for (j = jb; j <= je; j++){
for (i = ib; i <= ie; i++){
  ...
}}}

I tried with adding #pragma acc loop collapse(3) to the loop and #pragma acc routine(routine) vector just before the calls where I want the GPU to execute the routine. -Minfo=accel doesn't report any message and with Nsight-System I see that the routine is always executed by the CPU so in this way it doesn't work. Why the compiler is reading neither of the two #pragma?


Solution

  • To follow on to Thomas' answer, here's an example of using the "if" clause:

    % cat test.c
    #include <stdlib.h>
    #include <stdio.h>
    
    void compute(int * Arr, int size, int use_gpu) {
    
    #pragma acc parallel loop copyout(Arr[:size]) if(use_gpu)
        for (int i=0; i < size; ++i) {
            Arr[i] = i;
        }
    }
    
    int main() {
    
       int *Arr;
       int size;
       int use_gpu;
    
       size=1024;
       Arr = (int*) malloc(sizeof(int)*size);
    
    // Run on the host
       use_gpu=0;
       compute(Arr,size,use_gpu);
    
    // Run on the GPU
       use_gpu=1;
       compute(Arr,size,use_gpu);
    
       free(Arr);
    
    }
    % nvc -acc -Minfo=accel test.c
    compute:
          4, Generating copyout(Arr[:size]) [if not already present]
             Generating NVIDIA GPU code
              7, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    % setenv NV_ACC_TIME 1
    % a.out
    
    Accelerator Kernel Timing data
    test.c
      compute  NVIDIA  devicenum=0
        time(us): 48
        4: compute region reached 1 time
            4: kernel launched 1 time
                grid: [8]  block: [128]
                 device time(us): total=5 max=5 min=5 avg=5
                elapsed time(us): total=331 max=331 min=331 avg=331
        4: data region reached 2 times
            9: data copyout transfers: 1
                 device time(us): total=43 max=43 min=43 avg=43
    

    I'm using nvc and set the compiler's runtime profiler (NV_ACC_TIME=1) to show that the kernel is launched only once.