fortranopenaccpgi-accelerator

Sequential dot_product in OpenACC Fortran loop


In a Fortran program, I have a large loop with several dot_product calls on small vectors generated within the loop:

program test
        implicit none

        real :: array1(2, 2), array2(2, 2), res(2)
        real :: subarray1(2), subarray2(2)
        integer :: i

        array1 = 1
        array2 = 2

        !$acc data copyin(array1, array2) copyout(res)
        !$acc kernels
        !$acc loop independent private(subarray1, subarray2)
        do i = 1, 2
                subarray1(:) = array1(:, i)
                subarray2(:) = array2(:, i)
                res(i) = dot_product(subarray1, subarray2)
        enddo
        !$acc end kernels
        !$acc end data

        print "(2(g0, x))", res
endprogram

When compiled with the PGI compiler, it seems that the accelerated implementation of dot_product uses accelerated loops, and hence prevents to accelerate the main loop better (on gang and vector):

test:
     11, Generating copyin(array1(:,:)) [if not already present]
         Generating copyout(res(:)) [if not already present]
         Generating copyin(array2(:,:)) [if not already present]
     14, Loop is parallelizable
         Generating Tesla code
         14, !$acc loop gang ! blockidx%x
         15, !$acc loop vector(32) ! threadidx%x
         17, !$acc loop vector(32) ! threadidx%x
             Generating implicit reduction(+:subarray1$r)
     14, CUDA shared memory used for subarray2,subarray1
     15, Loop is parallelizable
     17, Loop is parallelizable

As seen in the logs, it uses implicit reduction and shared memory for the loop private vectors.

Is there a way to force dot_product to run sequentially?


Solution

  • Is there a way to force dot_product to run sequentially?

    So long as you don't mind the array syntax being run sequentially as well, just add "gang vector" to the loop directive.

    % cat test.f90
    program test
            implicit none
    
            real :: array1(2, 2), array2(2, 2), res(2)
            real :: subarray1(2), subarray2(2)
            integer :: i
    
            array1 = 1
            array2 = 2
    
            !$acc data copyin(array1, array2) copyout(res)
            !$acc kernels loop gang vector private(subarray1, subarray2)
            do i = 1, 2
                    subarray1(:) = array1(:, i)
                    subarray2(:) = array2(:, i)
                    res(i) = dot_product(subarray1, subarray2)
            enddo
            !$acc end data
    
            print "(2(g0, x))", res
    endprogram
    % nvfortran -acc -Minfo=accel test.f90
    test:
         11, Generating copyin(array1(:,:)) [if not already present]
             Generating copyout(res(:)) [if not already present]
             Generating copyin(array2(:,:)) [if not already present]
         13, Loop is parallelizable
             Generating Tesla code
             13, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             14, !$acc loop seq
             16, !$acc loop seq
         13, Local memory used for subarray2,subarray1
         14, Loop is parallelizable
         16, Loop is parallelizable