[SOLVED] A problem in calling several gpu subroutines sequentially: OpenACC

A problem in calling several gpu subroutines sequentially: OpenACC - Fortran

I have the following problem. I have a main subroutine, let us call it main_function (for 3D BSplines). It takes as input several tensors.

This function contains only IF-conditions. If a condition is satisfied, other functions are called. Let us call these functions: function_a, function_b, and function_c which are parallelizable.

The structure is as follows

subroutine main_function(paras)
if(1) then
call function_a
else if (2)
call function_b
else if (3)
call function_c
end if
end subroutine main_function

with

subroutine function_a(paras)
!$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_a

subroutine function_b(paras)
!$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_b

subroutine function_c(paras)
!$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_c

The subroutines function_a, function_b, and function_c have a B-spline tensor (eta) as an output calculated on GPU. I don't want to move this tensor to the host since it is not needed there. However, after calculating eta on GPU using main_function, an interpolation subroutine interpolate3D is called to interpolate the function. The definition of interpolate3D is something like

subroutine interpolate3D(eta, x, y, z, fAtxyz)

!$acc routine seq 

interpolate ...
end subroutine interpolate3D

To summarize the the pseudo-code is something like

call main_function(paras) 

!$acc parallel loop present(x, y, eta, fAtxyz)
do i = 1, N
call interpolate3D(eta, x(i), y(i), z(i), fAtxyz(i))
end do

My problems and questions are:

1)- When I don't use '!$acc update self (eta)' before the loop, the results are completely wrong. Does this mean that 'present clause' doesn't find correctly eta, calculated by main_function, on GPU. Therefore, one needs to update the host, and then recopy it back to the GPU?

2)- How to ensure that interpolate3D is working on GPU? For example, if I don't have the above loop, does only adding '!$acc routine seq' ensure that it works on GPU and searches for different quantities there?

3)- In fact, when there is no loop, adding '!$acc update self (eta)' is required to have correct results. Does this mean that in this case the subroutine is executed on CPU?

3)- To summarize, If I have two subroutines: the first choses between different subroutines based on if-conditions to calculate a vector or tensor and keep it on GPU (I don't want to update the host), while the second will use this vector to perform some calculations on GPU, how to do this correctly with openACC?

Sorry for being long and thank you very much for your help,

In fact, I have tried different strategies. However, all of them requires copying eta to the host before interpolating, even though it is only calculated on the device. There is something I don't understand since I'm also new to openacc

Solution

Cross-posted on NVIDIA's Forum: https://forums.developer.nvidia.com/t/b-splines-on-gpus-openacc-fortran/233053

Issue was an error in the user's code where a "parallel loop" was missing, hence the loop was not being run on the host.