I'm attempting to parallelize some Fortran 90 code using OpenACC, where a parallelized loop calls a sequential routine. When I attempt to run the code using the PGI Fortran compiler (2020.4), I obtain an error message saying that reference argument passing prevents parallelization.
My understanding is that this is likely because one routine exists on the Host while the other is on the Device, but I'm unclear on where I might be missing a pragma that would lead to this outcome.
The basic structure of the calling routine is:
subroutine OuterRoutine(F,G,X,Y)
real(wp), dimension(:,:), intent(IN) :: X
real(wp), dimension(:,:), intent(IN) :: Y
real(wp), dimension(1,PT), intent(OUT) :: F
real(wp), dimension(N_p,PT), intent(OUT) :: G
! Local Variables
integer :: t, i, j
!$acc data copyin(X,Y), copyout(F,G)
!$acc parallel loop
do t = 1,PT,1
!$acc loop collapse(2) reduction(+:intr)
do i = 1,N_int-1,1
do j = 1,N_int-1,1
G(i,j) = intgrdJ2(X(i,j),X(j,i),Y(i,j),Y(j,i),t)
end do
end do
!$acc end loop
!$acc end parallel loop
!$acc end data
end subroutine OuterRoutine
And the function being called is:
function intgrdJ2(z,mu,p,q,t)
!$acc routine seq
real(wp), intent(IN) :: z, mu, p, q
integer, intent(IN) :: t
real(wp) :: intgrdJ2
! Local Variables
real(wp) :: mu2
real(wp), dimension(N_p) :: nu_m2, psi_m2
integer :: i
mu2 = (mu*fh_pdf(z,mu,p))/f_pdf(z,mu,p)
do i = 1,N_p,1
nu_m2(i) = interpValue(mu2,mugrid,nu_knots(:,i,t))
psi_m2(i) = interpValue(mu2,mugrid,psi_knots(:,i,t))
end do
intgrdJ2 = nu_m2(i)*psi_m2(i)
end function intgrdJ2
The routines interpValue, fh_pdf, and f_pdf are all contained in a used module, and denoted as !$acc routine seq. The variables mugrid, nu_knots, and psi_knots are all module-level variables, which are copied-in to the Device prior to calling OuterRoutine.
When I run the code, I get this sort of output from the compiler:
intgrdj2:
576, Generating acc routine seq
Generating Tesla code
593, Reference argument passing prevents parallelization: mu2
Where 593 refers to the "nu_m2(i) = ..." line.
My understanding is that since the variable mu2 is a scalar declared inside of the sequential routine, each thread should have it's own copy of the variable, and I don't need to explicitly declare it to be private when I declare the data region. From reading this post it seems that the problem may be related to where the routines are located (Host vs Device). However, it seems as though all of the relevant pieces should be on the device because I'm specifying that routines are sequential.
As a first-time OpenACC user, any explanations about what I might be overlooking would be greatly appreciated!
My understanding is that since the variable mu2 is a scalar declared inside of the sequential routine, each thread should have it's own copy of the variable, and I don't need to explicitly declare it to be private when I declare the data region
This is true in most cases. But what's likely happening here is that since Fortran by default passes variables by reference, the compiler must assume that it's reference can be taken by a module variable. Unlikely, but possible.
The typical way to fix this is to pass the scalar by value, i.e. add the "value" attribute to the argument declaration in "interpValue". Alternately, you can explicitly privatize "mu2" by adding "!$acc loop seq private(mu2)" on the "i" loop.
Now the message may just be indicating that the compiler can't auto-parallelize this loop. But since it's in a sequential routine, that wouldn't matter and you can safely ignore the message. Though, I don't have the full context so can't be 100% certain of this.