nvidiaopenaccpgi

Data Clauses (output is zero when i use OpenACC)


I want to reduce runtime of my code by use the OpenACC but unfortunately when i use OpenACC the output becomes zero.

sajad.**

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <assert.h>
#include <openacc.h>
#include<time.h>
#include <string.h>
#include <malloc.h>

#define NX 201
#define NY 101
#define NZ 201
int main(void)
{
    int  i, j, k, l, m;
    static double   tr, w;
    static double  dt = 9.5e-9, t;
    static double cu[NZ];
    static double AA[NX][NY][NZ] , CC[NX][NY][NZ] , BB[NX][NY][NZ] ;
    static double A[NX][NY][NZ] , B[NX][NY][NZ] , C[NX][NY][NZ] ;
    FILE *file;
    file = fopen("BB-and-A.csv", "w");
    t = 0.;
    #pragma acc  data  copyin( tr, w,dt, t),copy(B ,A , C,AA , CC,BB,cu )
    {
        for (l = 1; l < 65; l++) {
            #pragma acc kernels loop private(i, j,k)
            for (i = 1; i < NX - 1; i++) {
                for (j = 0; j < NY - 1; j++) {
                    for (k = 1; k < NZ - 1; k++) {
                        A[i][j][k] = A[i][j][k]
                        + 1. * (B[i][j][k] - AA[i][j][k - 1]);
                    }
                }
            }
            #pragma acc kernels loop private(i, j,k)
            for (i = 1; i < NX - 1; i++) { /* BB */
                for (j = 1; j < NY - 1; j++) {
                    for (k = 0; k < NZ - 1; k++) {
                        B[i][j][k] =  B[i][j][k]
                        + 1.* (BB[i][j][k] - A[i - 1][j][k]);

                    }
                }
            }
            #pragma acc kernels
            for (m = 1; m < NZ - 1; m++) {
                tr = t - (double)(m)*5 / 1.5e8;
                if (tr <= 0.)
                    cu[m] = 0.;
                else {
                    w = (tr / 0.25e-6)*(tr / 0.25e-6);
                    cu[m] =1666*w / (w + 1.)*exp(-tr / 2.5e-6) ;
                    cu[m] = 2*cu[m];
                }
                A[10][60][m] = -cu[m];
            }
            #pragma acc update self(B)
            fprintf(file, "%e, %e \n", t*1e6,  -B[22][60][10] );
            t = t + dt;
        }
    }
    fclose(file);
}

Solution

  • The problem here is the "copyin( tr, w,dt, t)", and in particular the "t" variable. By putting these scalars in a data clause, you'll need to managed the synchronization between the host as device copies. Hence, when you update the variable on the host (i.e. "t = t + dt;"), you then need to update the device copy with the new value.

    Also, there's a potential race condition on "tr" since the device code will now the shared device variable instead of a private copy.

    Though, the easiest thing to do is to simply not put these scalars in a data clause. By default, OpenACC privatizes scalars so there's no need manage them yourself. In t's case, it's value will be passed as an argument to the CUDA kernel.

    To fix your code change:

    #pragma acc  data  copyin( tr, w,dt, t),copy(B ,A , C,AA , CC,BB,cu ) 
    

    to:

    #pragma acc  data  copy(B ,A , C,AA , CC,BB,cu )  
    

    Note that there's no need to put the loop indices in a private clause since they are implicitly private.