I have the following subroutine, which is my stepper function in a bigger DEM program. It computes every interaction between particles i
and j
, then updates the forces. I am now trying, as a first effort, to parallelize this thick O(N^2) loop, before I try to use fancier search algorithms.
But, I can't find a way to make it work. I know it is due to variables being modified by different threads, but I don't know how to deal with that properly. One possibility would be to redefine all my arrays to be matrices, but I still don't know if this is the correct way.
Also, note that I use include for my variables since I have a lot of variables that need to be accessed and modified by many subroutines, and it felt like the easy way to make that work, I am open to suggestions.
Here is my subroutine:
subroutine stepper (tstep)
use omp_lib
implicit none
include "parameter.h"
include "CB_variables.h"
include "CB_const.h"
include "CB_bond.h"
include "CB_forcings.h"
integer :: i, j
integer, intent(in) :: tstep
! reinitialize force arrays for contact and bonds
do i = 1, n
mc(i) = 0d0
mb(i) = 0d0
fcx(i) = 0d0
fcy(i) = 0d0
fbx(i) = 0d0
fby(i) = 0d0
end do
! put yourself in the referential of the ith particle
! loop through all j particles and compute interactions
!$omp parallel do schedule(dynamic) &
!$omp private(i,j) &
!$omp reduction(+:tfx,tfy,fcx,fcy,fbx,fby,m,mc,mb)
do i = 1, n
do j = i + 1, n
! compute relative position and velocity
call rel_pos_vel (i, j)
! bond initialization
if ( tstep .eq. 1 ) then
if ( -deltan(i, j) .le. 5d-1 * r(i)) then ! can be fancier
!bond (i, j) = 1
end if
call bond_properties (i, j)
end if
! verify if two particles are colliding
if ( deltan(i,j) .gt. 0 ) then
call contact_forces (i, j)
!call bond_creation (i, j) ! to implement
! update contact force on particle i by particle j
fcx(i) = fcx(i) - fcn(i,j) * cosa(i,j)
fcy(i) = fcy(i) - fcn(i,j) * sina(i,j)
! update moment on particule i by particule j due to tangent contact
mc(i) = mc(i) - r(i) * fct(i,j) - mcc(i,j)
! Newton's third law
! update contact force on particle j by particle i
fcx(j) = fcx(j) + fcn(i,j) * cosa(i,j)
fcy(j) = fcy(j) + fcn(i,j) * sina(i,j)
! update moment on particule j by particule i due to tangent contact
mc(j) = mc(j) - r(j) * fct(i,j) + mcc(i,j)
end if
! compute forces from bonds between particle i and j
if ( bond (i, j) .eq. 1 ) then
call bond_forces (i, j)
!call bond_breaking (i, j)
! update force on particle i by particle j due to bond
fbx(i) = fbx(i) - fbn(i,j) * cosa(i,j) + &
fbt(i,j) * sina(i,j)
fby(i) = fby(i) - fbn(i,j) * sina(i,j) - &
fbt(i,j) * cosa(i,j)
! update moment on particule i by particule j to to bond
mb(i) = mb(i) - r(i) * fbt(i,j) - mbb(i, j)
! Newton's third law
! update force on particle j by particle i due to bond
fbx(j) = fbx(j) + fbn(i,j) * cosa(i,j) - &
fbt(i,j) * sina(i,j)
fby(j) = fby(j) + fbn(i,j) * sina(i,j) + &
fbt(i,j) * cosa(i,j)
! update moment on particule j by particule i to to bond
mb(j) = mb(j) - r(i) * fbt(i,j) + mbb(j, i)
end if
! compute sheltering height for particule j on particle i for air and water drag
call sheltering(i, j)
end do
! compute the total forcing from winds, currents and coriolis on particule i
call forcing (i)
!call coriolis(i)
! reinitialize total force arrays before summing everything
m(i) = 0d0
tfx(i) = 0d0
tfy(i) = 0d0
! sum all forces together on particule i
tfx(i) = fcx(i) + fbx(i) + fax(i) + fwx(i)
tfy(i) = fcy(i) + fby(i) + fay(i) + fwy(i)
! sum all moments on particule i together
m(i) = mc(i) + mb(i) + ma(i) + mw(i)
end do
!$omp end parallel do
! forces on side particles for experiments
call experiment_forces
! integration in time
call velocity
call euler
end subroutine stepper
rel_pos_vel(i, j)
computes relative positions, velocities, angles, etc. between particles i
and j
. Everything inside is a 2D array from common blocks.
bond_properties(i, j)
same thing here
contact_forces(i, j)
some local variables as well as some arrays from common blocks
bond_forces(i, j)
same thing
sheltering
and forcing
same
I tried to put different variables (the ones wthat are being changed tfx, tfy, m, etc.) in the private state, but without success. I don't think I understand well the reduction tag as well. Also, I know that some constant arrays like the radius r(i)
are accessed by multiple threads, but I don't know how to deal with that.
Actually, when printing the number of threads in the parallel region (using omp_get_num_thread()
), the output is 1, indicating that no parallelization is taking place.
Also, when trying to parallelize the first initialization loop as a first step towards understanding:
!$omp do private(i) schedule(dynamic) num_threads(10)
print*, omp_get_num_threads()
do i = 1, n
mc(i) = 0d0
mb(i) = 0d0
fcx(i) = 0d0
fcy(i) = 0d0
fbx(i) = 0d0
fby(i) = 0d0
end do
!$omp parallel do
it is still not working, the output of omp_get_num_threads
is 1. And the number of threads is explicitly given to be 10.
I am compiling using gfortran, with the flags -ffast-math
, -O3
, -fopenmp
As I wrote in a comment: Before the end of the i
do loop you are updating the tfx
and tfy
array, but only with the i
index. Consequently, all what has been calculated with the j
index does not participate here. So you should update it either inside the j do loop and also for the j index, or in a separate loop after the parallel loop.
Here is an demonstration code:
program foo
implicit none
integer, parameter :: N = 40
integer :: s(N), t(N), i, j
s(:) = 0
t(:) = 0
!$ call omp_set_num_threads(2)
!$OMP PARALLEL DO PRIVATE(j) REDUCTION(+:s,t) SCHEDULE(dynamic,1)
do i = 1, N
do j = i+1, N
s(i) = s(i)+1
s(j) = s(j)+1
end do
t(i) = s(i)
end do
!$END END PARALLEL DO
write(*,*) "s array:"
write(*,"(10I5)") s(:)
write(*,*) "t array:"
write(*,"(10I5)") t(:)
end program
If I run it without OpenMP (or optionally with a single thread) the output is as expected:
s array:
39 39 39 39 39 39 39 39 39 39
39 39 39 39 39 39 39 39 39 39
39 39 39 39 39 39 39 39 39 39
39 39 39 39 39 39 39 39 39 39
t array:
39 39 39 39 39 39 39 39 39 39
39 39 39 39 39 39 39 39 39 39
39 39 39 39 39 39 39 39 39 39
39 39 39 39 39 39 39 39 39 39```
Now if I run it with OpenMP I get garbage in t(:):
s array:
39 39 39 39 39 39 39 39 39 39
39 39 39 39 39 39 39 39 39 39
39 39 39 39 39 39 39 39 39 39
39 39 39 39 39 39 39 39 39 39
t array:
39 38 38 37 37 36 36 35 35 34
34 33 33 32 32 31 31 30 30 29
29 28 28 27 27 26 26 25 25 24
24 23 23 22 22 21 21 20 20 19
The solution is then obviously to update t
only after the parallel loop:
program foo
implicit none
integer, parameter :: N = 40
integer :: s(N), t(N), i, j
s(:) = 0
!$ call omp_set_num_threads(2)
!$OMP PARALLEL DO PRIVATE(j) REDUCTION(+:s) SCHEDULE(dynamic,1)
do i = 1, N
do j = i+1, N
s(i) = s(i)+1
s(j) = s(j)+1
end do
end do
!$END END PARALLEL DO
t(:) = s(:)
write(*,*) "s array:"
write(*,"(10I5)") s(:)
write(*,*) "t array:"
write(*,"(10I5)") t(:)
end program
In the example above (with OpenMP and reduction on t
), let's assume the thread 0 processes the odd i
iterations, and thread 1 the even i
iterations. because of the reduction, s
and t
are implicitely private in the OpenMP region: let's denote s0
, t0
, s1
, t1
the private versions.
j
:
s0(:)=[N-1,1,1,...,1]
t0(:)=[N-1,0,0,...,0]
j
:
s1(:)=[0,N-2,1,...,1]
t1(:)=[0,N-2,0,...,0]
j
:
s0(:)=[N-1,1,N-2,2,2,...,2]
t0(:)=[N-1,0,N-2,0,0,...,0]
j
:
s1(:)=[0,N-2,1,N-3,1,...,1]
t1(:)=[0,N-2,0,N-3,0,...,0]
...
j
:
s0(:)=[N-1,1,N-2,2,N-3,3,N-4,4,...,N/2-1,N/2]
t0(:)=[N-1,0,N-2,0,N-3,0,N-4,0,...,N/2-1,0]
j
:
s1(:)=[0,N-2,1,N-3,1,N-4,...,N/2,N/2-1]
t1(:)=[0,N-2,0,N-3,0,N-4,...,0 ,N/2]
At the end of the parallel region, the reduction consists in s=s0+s1
and t=t0+t1
. You can see that it goes well for s
, but not for t