
Sequential dot_product in OpenACC Fortran loop

In a Fortran program, I have a large loop with several dot_product calls on small vectors generated within the loop:

program test
        implicit none

        real :: array1(2, 2), array2(2, 2), res(2)
        real :: subarray1(2), subarray2(2)
        integer :: i

        array1 = 1
        array2 = 2

        !$acc data copyin(array1, array2) copyout(res)
        !$acc kernels
        !$acc loop independent private(subarray1, subarray2)
        do i = 1, 2
                subarray1(:) = array1(:, i)
                subarray2(:) = array2(:, i)
                res(i) = dot_product(subarray1, subarray2)
        !$acc end kernels
        !$acc end data

        print "(2(g0, x))", res

When compiled with the PGI compiler, it seems that the accelerated implementation of dot_product uses accelerated loops, and hence prevents to accelerate the main loop better (on gang and vector):

     11, Generating copyin(array1(:,:)) [if not already present]
         Generating copyout(res(:)) [if not already present]
         Generating copyin(array2(:,:)) [if not already present]
     14, Loop is parallelizable
         Generating Tesla code
         14, !$acc loop gang ! blockidx%x
         15, !$acc loop vector(32) ! threadidx%x
         17, !$acc loop vector(32) ! threadidx%x
             Generating implicit reduction(+:subarray1$r)
     14, CUDA shared memory used for subarray2,subarray1
     15, Loop is parallelizable
     17, Loop is parallelizable

As seen in the logs, it uses implicit reduction and shared memory for the loop private vectors.

Is there a way to force dot_product to run sequentially?


  • Is there a way to force dot_product to run sequentially?

    So long as you don't mind the array syntax being run sequentially as well, just add "gang vector" to the loop directive.

    % cat test.f90
    program test
            implicit none
            real :: array1(2, 2), array2(2, 2), res(2)
            real :: subarray1(2), subarray2(2)
            integer :: i
            array1 = 1
            array2 = 2
            !$acc data copyin(array1, array2) copyout(res)
            !$acc kernels loop gang vector private(subarray1, subarray2)
            do i = 1, 2
                    subarray1(:) = array1(:, i)
                    subarray2(:) = array2(:, i)
                    res(i) = dot_product(subarray1, subarray2)
            !$acc end data
            print "(2(g0, x))", res
    % nvfortran -acc -Minfo=accel test.f90
         11, Generating copyin(array1(:,:)) [if not already present]
             Generating copyout(res(:)) [if not already present]
             Generating copyin(array2(:,:)) [if not already present]
         13, Loop is parallelizable
             Generating Tesla code
             13, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             14, !$acc loop seq
             16, !$acc loop seq
         13, Local memory used for subarray2,subarray1
         14, Loop is parallelizable
         16, Loop is parallelizable