I am accelerating a Fortran code with OpenACC. When I profile the program with NVIDIA Nsight, I noticed the first call of a kernel with a copyout
clause exhibited a long call to cuMemToHostAlloc
.
Here is a trivial example illustrating this. The program launches successively 10 times a kernel that computes an array test
and returns its value:
program test
implicit none
real, allocatable :: test(:)
integer :: i, j, n, m
n = 1000
m = 10
allocate(test(n))
do j = 1, m
!$acc kernels copyout(test)
!$acc loop independent
do i = 1, n
test(i) = real(i)
end do
!$acc end kernels
end do
deallocate(test)
end program test
The code is compiled with NVHPC 22.7, using no optimization flag (adding such flags did not have any influence). The profiling of the code gives:
Compared to the actual memory transfer time, as seen for the 9 other calls, the call to cuMemToHostAlloc
is ridiculously long.
If I remove the copyout
clause, the call to cuMemToHostAlloc
disappears, so this is related to copying back data from the device, but I do not understand why it only happens once and for so long.
Also, the test
array is already allocated on the host memory.
Am I missing something?
It's the call to create the pinned memory buffers used to transfer the data between the host and device. DMA transfer must use non-swappable, i.e. pinned, memory.
We use a double buffering system where as one buffer is being filled with the virtual memory, the second buffer is transferred asynchronously to the device. Effectively hiding much of the virtual to pinned memory copy.
The host pinned memory allocation is relatively expensive but only occurs once when the runtime first encounters a data region so the cost will be amortized.
Note by removing the copyout, you're removing the need to transfer the data and hence no need for the buffers.