I'm dealing with a routine which I want the first time to be executed by the CPU and every other time by the GPU. This routine contains the loop:
for (k = kb; k <= ke; k++){
for (j = jb; j <= je; j++){
for (i = ib; i <= ie; i++){
...
}}}
I tried with adding #pragma acc loop collapse(3)
to the loop and #pragma acc routine(routine) vector
just before the calls where I want the GPU to execute the routine. -Minfo=accel
doesn't report any message and with Nsight-System I see that the routine is always executed by the CPU so in this way it doesn't work.
Why the compiler is reading neither of the two #pragma
?
To follow on to Thomas' answer, here's an example of using the "if" clause:
% cat test.c
#include <stdlib.h>
#include <stdio.h>
void compute(int * Arr, int size, int use_gpu) {
#pragma acc parallel loop copyout(Arr[:size]) if(use_gpu)
for (int i=0; i < size; ++i) {
Arr[i] = i;
}
}
int main() {
int *Arr;
int size;
int use_gpu;
size=1024;
Arr = (int*) malloc(sizeof(int)*size);
// Run on the host
use_gpu=0;
compute(Arr,size,use_gpu);
// Run on the GPU
use_gpu=1;
compute(Arr,size,use_gpu);
free(Arr);
}
% nvc -acc -Minfo=accel test.c
compute:
4, Generating copyout(Arr[:size]) [if not already present]
Generating NVIDIA GPU code
7, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
% setenv NV_ACC_TIME 1
% a.out
Accelerator Kernel Timing data
test.c
compute NVIDIA devicenum=0
time(us): 48
4: compute region reached 1 time
4: kernel launched 1 time
grid: [8] block: [128]
device time(us): total=5 max=5 min=5 avg=5
elapsed time(us): total=331 max=331 min=331 avg=331
4: data region reached 2 times
9: data copyout transfers: 1
device time(us): total=43 max=43 min=43 avg=43
I'm using nvc and set the compiler's runtime profiler (NV_ACC_TIME=1) to show that the kernel is launched only once.