OpenACC Declare Construct

I went through the OpenACC 2.6 supported features with PGI compilers, and encountered an issue with the memory management between CPU and GPU.

The following Fortran code is a modified version from the official document:

module data
  integer, parameter :: maxl = 100000
  real, dimension(maxl) :: xstat
  real, dimension(:), allocatable :: yalloc
  !$acc declare create(xstat,yalloc)
end module

module useit
  use data
contains
  subroutine compute(n)
     integer :: n
     integer :: i
     !$acc parallel loop present(yalloc)
     do i = 1, n
        yalloc(i) = iprocess(i)
     enddo
  end subroutine
  real function iprocess(i)
     !$acc routine seq
     integer :: i
     iprocess = yalloc(i) + 2*xstat(i)
  end function
end module

program main

  use data
  use useit

  implicit none

  integer :: nSize = 100
  !---------------------------------------------------------------------------

  call allocit(nSize)
  call initialize

  call compute(nSize)

  !$acc update self(yalloc) 
  write(*,*) "yalloc(10)=",yalloc(10) ! should be 3

  call finalize
  
contains
  subroutine allocit(n)
    integer :: n
    allocate(yalloc(n))
  end subroutine allocit
  
  subroutine initialize
    xstat = 1.0
    yalloc = 1.0
    !$acc enter data copyin(xstat,yalloc)
  end subroutine initialize

  subroutine finalize

    deallocate(yalloc)
    
  end subroutine finalize
  
end program main

This code can be compiled with nvfortran:

nvfortran -Minfo test.f90

and it shows the expected value on CPU:

yalloc(10)=    3.000000

However, when compiled with OpenACC:

nvfortran -add -Minfo test.f90

the code does not show the correct output:

upload CUDA data  device=0 threadid=1 variable=descriptor bytes=128
upload CUDA data  device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data  file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=initialize line=55 device=0 threadid=1 variable=.attach. bytes=8
launch CUDA kernel  file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=compute line=14 device=0 threadid=1 num_gangs=1 num_workers=1 vector_length=128 grid=1 block=128
download CUDA data  file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=main line=41 device=0 threadid=1 variable=yalloc bytes=400
 yalloc(10)=    0.000000

I have tried to add some explicit memory movement in several places, but nothing helps. This is really confusing to me.

Solution

The problem is in your initialize routine:

  subroutine initialize
    xstat = 1.0
    yalloc = 1.0
    !acc enter data copyin(xstat,yalloc)
    !$acc update device(xstat,yalloc)
  end subroutine initialize

Since xstat and yalloc are already in a data region (the declare directive), the second data region ("enter data copyin") is essentially ignored (though the reference counter is updated). Instead, you need to use an update directive to synchronize the data.

With this change, the code gets the correct answers:

% nvfortran test.f90 -acc -Minfo=accel; a.out
compute:
     14, Generating Tesla code
         15, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
iprocess:
     19, Generating acc routine seq
         Generating Tesla code
main:
     41, Generating update self(yalloc(:))
initialize:
     56, Generating update device(yalloc(:),xstat(:))
 yalloc(10)=    3.000000