I would like to understand clearly a situations I faced often accelerating an application with OpenACC. Let's say I have this loop:
#pragma acc parallel loop collapse(4)
for (k = KBEG; k <= KEND; k++){
for (j = JBEG; j <= JEND; j++){
for (i = IBEG; i <= IEND; i++){
for (nv = 0; nv < NVAR; nv++) A0[k][j][i][nv] =
data->A[k][j][i][nv];
}}}
Being data a structured type variable:
typedef struct Data_{
double ****A;
double ****B;
} Data;
I noticed that both with Unified Memory (-ta=tesla:managed) or not, I get an error at the execution: error 700: Illegal address during kernel execution
.
I identified the problem with the deep copy problem I read in literature: the implicit copy done by the compiler does a simple copy of A, that points to an address on the host memory, but not a copy of the data it is pointing to. The host address cannot be read by the device and this generates the error.
Is the deep copy problem the correct interpretation of my error?
Moreover, if I'm using Unified Memory and it is indeed a deep copy problem, shouldn't the device be capable of reading the address, being A, at least virtually, situated on unified memory and address space?
I can easly resolve the error adding the diretive:
#pragma acc enter data(data)
and adding present(data)
to the parallel pragma. Notice that I don't need to copy manually A and B.
I would like to understand the reason of both the problem and the solution.
Unified memory is only available for allocated (heap) memory. I'm assuming that "data" itself is not allocated? In that case, you do need to include it in a data region and should add the "present" clause so the compiler doesn't try to implicitly copy it.