I went through the OpenACC 2.6 supported features with PGI compilers, and encountered an issue with the memory management between CPU and GPU.
The following Fortran code is a modified version from the official document:
module data
integer, parameter :: maxl = 100000
real, dimension(maxl) :: xstat
real, dimension(:), allocatable :: yalloc
!$acc declare create(xstat,yalloc)
end module
module useit
use data
contains
subroutine compute(n)
integer :: n
integer :: i
!$acc parallel loop present(yalloc)
do i = 1, n
yalloc(i) = iprocess(i)
enddo
end subroutine
real function iprocess(i)
!$acc routine seq
integer :: i
iprocess = yalloc(i) + 2*xstat(i)
end function
end module
program main
use data
use useit
implicit none
integer :: nSize = 100
!---------------------------------------------------------------------------
call allocit(nSize)
call initialize
call compute(nSize)
!$acc update self(yalloc)
write(*,*) "yalloc(10)=",yalloc(10) ! should be 3
call finalize
contains
subroutine allocit(n)
integer :: n
allocate(yalloc(n))
end subroutine allocit
subroutine initialize
xstat = 1.0
yalloc = 1.0
!$acc enter data copyin(xstat,yalloc)
end subroutine initialize
subroutine finalize
deallocate(yalloc)
end subroutine finalize
end program main
This code can be compiled with nvfortran
:
nvfortran -Minfo test.f90
and it shows the expected value on CPU:
yalloc(10)= 3.000000
However, when compiled with OpenACC:
nvfortran -add -Minfo test.f90
the code does not show the correct output:
upload CUDA data device=0 threadid=1 variable=descriptor bytes=128
upload CUDA data device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=initialize line=55 device=0 threadid=1 variable=.attach. bytes=8
launch CUDA kernel file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=compute line=14 device=0 threadid=1 num_gangs=1 num_workers=1 vector_length=128 grid=1 block=128
download CUDA data file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=main line=41 device=0 threadid=1 variable=yalloc bytes=400
yalloc(10)= 0.000000
I have tried to add some explicit memory movement in several places, but nothing helps. This is really confusing to me.
The problem is in your initialize routine:
subroutine initialize
xstat = 1.0
yalloc = 1.0
!acc enter data copyin(xstat,yalloc)
!$acc update device(xstat,yalloc)
end subroutine initialize
Since xstat and yalloc are already in a data region (the declare directive), the second data region ("enter data copyin") is essentially ignored (though the reference counter is updated). Instead, you need to use an update directive to synchronize the data.
With this change, the code gets the correct answers:
% nvfortran test.f90 -acc -Minfo=accel; a.out
compute:
14, Generating Tesla code
15, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
iprocess:
19, Generating acc routine seq
Generating Tesla code
main:
41, Generating update self(yalloc(:))
initialize:
56, Generating update device(yalloc(:),xstat(:))
yalloc(10)= 3.000000