In a Fortran program, I have a large loop with several dot_product
calls on small vectors generated within the loop:
program test
implicit none
real :: array1(2, 2), array2(2, 2), res(2)
real :: subarray1(2), subarray2(2)
integer :: i
array1 = 1
array2 = 2
!$acc data copyin(array1, array2) copyout(res)
!$acc kernels
!$acc loop independent private(subarray1, subarray2)
do i = 1, 2
subarray1(:) = array1(:, i)
subarray2(:) = array2(:, i)
res(i) = dot_product(subarray1, subarray2)
enddo
!$acc end kernels
!$acc end data
print "(2(g0, x))", res
endprogram
When compiled with the PGI compiler, it seems that the accelerated implementation of dot_product
uses accelerated loops, and hence prevents to accelerate the main loop better (on gang and vector):
test:
11, Generating copyin(array1(:,:)) [if not already present]
Generating copyout(res(:)) [if not already present]
Generating copyin(array2(:,:)) [if not already present]
14, Loop is parallelizable
Generating Tesla code
14, !$acc loop gang ! blockidx%x
15, !$acc loop vector(32) ! threadidx%x
17, !$acc loop vector(32) ! threadidx%x
Generating implicit reduction(+:subarray1$r)
14, CUDA shared memory used for subarray2,subarray1
15, Loop is parallelizable
17, Loop is parallelizable
As seen in the logs, it uses implicit reduction and shared memory for the loop private vectors.
Is there a way to force dot_product
to run sequentially?
Is there a way to force dot_product to run sequentially?
So long as you don't mind the array syntax being run sequentially as well, just add "gang vector" to the loop directive.
% cat test.f90
program test
implicit none
real :: array1(2, 2), array2(2, 2), res(2)
real :: subarray1(2), subarray2(2)
integer :: i
array1 = 1
array2 = 2
!$acc data copyin(array1, array2) copyout(res)
!$acc kernels loop gang vector private(subarray1, subarray2)
do i = 1, 2
subarray1(:) = array1(:, i)
subarray2(:) = array2(:, i)
res(i) = dot_product(subarray1, subarray2)
enddo
!$acc end data
print "(2(g0, x))", res
endprogram
% nvfortran -acc -Minfo=accel test.f90
test:
11, Generating copyin(array1(:,:)) [if not already present]
Generating copyout(res(:)) [if not already present]
Generating copyin(array2(:,:)) [if not already present]
13, Loop is parallelizable
Generating Tesla code
13, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
14, !$acc loop seq
16, !$acc loop seq
13, Local memory used for subarray2,subarray1
14, Loop is parallelizable
16, Loop is parallelizable