parallel-processing fortran openmp do-loops

How to properly choose which variables should be private/shared/reduced in nested do-loops in Fortran with openmp?

I have the following subroutine, which is my stepper function in a bigger DEM program. It computes every interaction between particles i and j, then updates the forces. I am now trying, as a first effort, to parallelize this thick O(N^2) loop, before I try to use fancier search algorithms.

But, I can't find a way to make it work. I know it is due to variables being modified by different threads, but I don't know how to deal with that properly. One possibility would be to redefine all my arrays to be matrices, but I still don't know if this is the correct way.

Also, note that I use include for my variables since I have a lot of variables that need to be accessed and modified by many subroutines, and it felt like the easy way to make that work, I am open to suggestions.

Here is my subroutine:

subroutine stepper (tstep)

    use omp_lib

    implicit none

    include "parameter.h"
    include "CB_variables.h"
    include "CB_const.h"
    include "CB_bond.h"
    include "CB_forcings.h"

    integer :: i, j
    integer, intent(in) :: tstep

    ! reinitialize force arrays for contact and bonds
    do i = 1, n
        mc(i)  = 0d0
        mb(i)  = 0d0
        fcx(i) = 0d0
        fcy(i) = 0d0
        fbx(i) = 0d0
        fby(i) = 0d0
    end do
    
    ! put yourself in the referential of the ith particle
    ! loop through all j particles and compute interactions

    !$omp parallel do schedule(dynamic) &
    !$omp private(i,j) &
    !$omp reduction(+:tfx,tfy,fcx,fcy,fbx,fby,m,mc,mb)
    do i = 1, n
        do j = i + 1, n

            ! compute relative position and velocity
            call rel_pos_vel (i, j)

            ! bond initialization
            if ( tstep .eq. 1 ) then
                if ( -deltan(i, j) .le. 5d-1 * r(i)) then ! can be fancier
                    !bond (i, j) = 1
                end if
                call bond_properties (i, j)
            end if

            ! verify if two particles are colliding
            if ( deltan(i,j) .gt. 0 ) then

                call contact_forces (i, j)
                !call bond_creation (i, j) ! to implement

                ! update contact force on particle i by particle j
                fcx(i) = fcx(i) - fcn(i,j) * cosa(i,j)
                fcy(i) = fcy(i) - fcn(i,j) * sina(i,j)

                ! update moment on particule i by particule j due to tangent contact 
                mc(i) = mc(i) - r(i) * fct(i,j) - mcc(i,j)

                ! Newton's third law
                ! update contact force on particle j by particle i
                fcx(j) = fcx(j) + fcn(i,j) * cosa(i,j)
                fcy(j) = fcy(j) + fcn(i,j) * sina(i,j)

                ! update moment on particule j by particule i due to tangent contact 
                mc(j) = mc(j) - r(j) * fct(i,j) + mcc(i,j)

            end if

            ! compute forces from bonds between particle i and j
            if ( bond (i, j) .eq. 1 ) then

                call bond_forces (i, j)
                !call bond_breaking (i, j)

                ! update force on particle i by particle j due to bond
                fbx(i) = fbx(i) - fbn(i,j) * cosa(i,j) +    &
                                    fbt(i,j) * sina(i,j)
                fby(i) = fby(i) - fbn(i,j) * sina(i,j) -    &
                                    fbt(i,j) * cosa(i,j)

                ! update moment on particule i by particule j to to bond
                mb(i) = mb(i) - r(i) * fbt(i,j) - mbb(i, j)

                ! Newton's third law
                ! update force on particle j by particle i due to bond
                fbx(j) = fbx(j) + fbn(i,j) * cosa(i,j) -    &
                                    fbt(i,j) * sina(i,j)
                fby(j) = fby(j) + fbn(i,j) * sina(i,j) +    &
                                    fbt(i,j) * cosa(i,j)
 
                
                ! update moment on particule j by particule i to to bond
                mb(j) = mb(j) - r(i) * fbt(i,j) + mbb(j, i)

            end if

            ! compute sheltering height for particule j on particle i for air and water drag
            call sheltering(i, j)

        end do

        ! compute the total forcing from winds, currents and coriolis on particule i
        call forcing (i)
        !call coriolis(i)

        ! reinitialize total force arrays before summing everything
        m(i)   = 0d0
        tfx(i) = 0d0
        tfy(i) = 0d0

        ! sum all forces together on particule i
        tfx(i) = fcx(i) + fbx(i) + fax(i) + fwx(i)
        tfy(i) = fcy(i) + fby(i) + fay(i) + fwy(i)

        ! sum all moments on particule i together
        m(i) =  mc(i) + mb(i) + ma(i) + mw(i)
    end do
    !$omp end parallel do

    ! forces on side particles for experiments
    call experiment_forces

    ! integration in time
    call velocity
    call euler

end subroutine stepper

rel_pos_vel(i, j) computes relative positions, velocities, angles, etc. between particles i and j. Everything inside is a 2D array from common blocks.
bond_properties(i, j) same thing here
contact_forces(i, j) some local variables as well as some arrays from common blocks
bond_forces(i, j) same thing
sheltering and forcing same

I tried to put different variables (the ones wthat are being changed tfx, tfy, m, etc.) in the private state, but without success. I don't think I understand well the reduction tag as well. Also, I know that some constant arrays like the radius r(i) are accessed by multiple threads, but I don't know how to deal with that.

Actually, when printing the number of threads in the parallel region (using omp_get_num_thread()), the output is 1, indicating that no parallelization is taking place.

Also, when trying to parallelize the first initialization loop as a first step towards understanding:

!$omp do private(i) schedule(dynamic) num_threads(10)
print*, omp_get_num_threads()
do i = 1, n
    mc(i)  = 0d0
    mb(i)  = 0d0
    fcx(i) = 0d0
    fcy(i) = 0d0
    fbx(i) = 0d0
    fby(i) = 0d0
end do
!$omp parallel do

it is still not working, the output of omp_get_num_threads is 1. And the number of threads is explicitly given to be 10.

I am compiling using gfortran, with the flags -ffast-math, -O3, -fopenmp

Solution

As I wrote in a comment: Before the end of the i do loop you are updating the tfx and tfy array, but only with the i index. Consequently, all what has been calculated with the j index does not participate here. So you should update it either inside the j do loop and also for the j index, or in a separate loop after the parallel loop.

Here is an demonstration code:

program foo
implicit none

integer, parameter :: N = 40
integer :: s(N), t(N), i, j

s(:) = 0
t(:) = 0
!$ call omp_set_num_threads(2)
!$OMP PARALLEL DO PRIVATE(j) REDUCTION(+:s,t) SCHEDULE(dynamic,1)
do i = 1, N
    do j = i+1, N
        s(i) = s(i)+1
        s(j) = s(j)+1
    end do
    t(i) = s(i)
end do
!$END END PARALLEL DO

write(*,*) "s array:"
write(*,"(10I5)") s(:)
write(*,*) "t array:"
write(*,"(10I5)") t(:)
end program

If I run it without OpenMP (or optionally with a single thread) the output is as expected:

s array:
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
t array:
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39```

Now if I run it with OpenMP I get garbage in t(:):

 s array:
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
 t array:
   39   38   38   37   37   36   36   35   35   34
   34   33   33   32   32   31   31   30   30   29
   29   28   28   27   27   26   26   25   25   24
   24   23   23   22   22   21   21   20   20   19

The solution is then obviously to update t only after the parallel loop:

program foo
implicit none

integer, parameter :: N = 40
integer :: s(N), t(N), i, j

s(:) = 0

!$ call omp_set_num_threads(2)
!$OMP PARALLEL DO PRIVATE(j) REDUCTION(+:s) SCHEDULE(dynamic,1)
do i = 1, N
    do j = i+1, N
        s(i) = s(i)+1
        s(j) = s(j)+1
    end do
end do
!$END END PARALLEL DO

t(:) = s(:)

write(*,*) "s array:"
write(*,"(10I5)") s(:)
write(*,*) "t array:"
write(*,"(10I5)") t(:)
end program

Explanation:

In the example above (with OpenMP and reduction on t), let's assume the thread 0 processes the odd i iterations, and thread 1 the even i iterations. because of the reduction, s and t are implicitely private in the OpenMP region: let's denote s0, t0, s1, t1 the private versions.

in thread 0, iter 1, at the end of the loop on j:
- s0(:)=[N-1,1,1,...,1]
- t0(:)=[N-1,0,0,...,0]
in thread 1, iter 2, at the end of the loop on j:
- s1(:)=[0,N-2,1,...,1]
- t1(:)=[0,N-2,0,...,0]
in thread 0, iter 3, at the end of the loop on j:
- s0(:)=[N-1,1,N-2,2,2,...,2]
- t0(:)=[N-1,0,N-2,0,0,...,0]
in thread 1, iter 4, at the end of the loop on j:
- s1(:)=[0,N-2,1,N-3,1,...,1]
- t1(:)=[0,N-2,0,N-3,0,...,0]

...

in thread 0, iter N-1, at the end of the loop on j:
- s0(:)=[N-1,1,N-2,2,N-3,3,N-4,4,...,N/2-1,N/2]
- t0(:)=[N-1,0,N-2,0,N-3,0,N-4,0,...,N/2-1,0]
in thread 1, iter N, at the end of the loop on j:
- s1(:)=[0,N-2,1,N-3,1,N-4,...,N/2,N/2-1]
- t1(:)=[0,N-2,0,N-3,0,N-4,...,0 ,N/2]

At the end of the parallel region, the reduction consists in s=s0+s1 and t=t0+t1. You can see that it goes well for s, but not for t