I am running the same fortran code on my local windows machine and a cluster.
A mwe of the code would be:
program mwe
use omp_lib
implicit none
integer :: indx
integer :: indy
integer :: indz
integer :: total_z = 1000000
integer :: total_x = 200
integer :: total_y = 100
real, allocatable :: output(:,:), grid_x(:), grid_y(:)
real :: startTime, midTime, totalTime
allocate(output(total_x, total_y))
allocate(grid_x(total_x))
allocate(grid_y(total_y))
do indx = 1, total_x
grid_x(indx) = indx * 2
end do
do indy = 1, total_y
grid_y(indy) = indy / 3
end do
output = 0.0
startTime = omp_get_wtime()
do indz = 1,total_z
do indy = 1,total_y
!$omp parallel do default(private) shared(indy, total_y, indz, total_z, total_x, output, startTime)
do indx = 1,total_x
output(indx,indy) = output(indx,indy) + grid_x(indx)/grid_y(indy)
if ((mod(indz, 100000) .eq. 0) .and. (indx .eq. 1) .and. (indy .eq. 1)) then
midTime = omp_get_wtime()
write(*,"(A,F15.1,A)") "Total time elapsed",midTime-startTime," seconds."
end if
end do
!$omp end parallel do
end do
end do
end program
On windows I am compiling the code as:
ifort /O2 /c /Qopenmp main.f90
ifort /o main.out /O2 /Qopenmp main.f90 /link /STACK:999999999,999999999
main.out
And the output is:
Total time elapsed 119.4 seconds.
Total time elapsed 237.8 seconds.
Total time elapsed 357.7 seconds.
Total time elapsed 474.1 seconds.
Total time elapsed 588.7 seconds.
Total time elapsed 730.1 seconds.
Total time elapsed 873.0 seconds.
Total time elapsed 1015.4 seconds.
Total time elapsed 1159.6 seconds.
On the cluster (linux) I am compiling the code as:
#!/bin/bash
#SBATCH -e error_file.err
#SBATCH -p common
#SBATCH -c 40
#SBATCH --job-name==mwe
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_STACKSIZE=512mb
source /opt/apps/rhel8/intel-2020/compilers_and_libraries/linux/bin/compilervars.sh intel64
ifort -O2 -c -qopenmp main.f90
ifort -o main.out -O2 -qopenmp main.f90
./main.out
and the output is:
Total time elapsed 0.0 seconds.
Total time elapsed 0.0 seconds.
Total time elapsed 128.0 seconds.
Total time elapsed 128.0 seconds.
Total time elapsed 128.0 seconds.
Total time elapsed 256.0 seconds.
Total time elapsed 256.0 seconds.
Total time elapsed 384.0 seconds.
Total time elapsed 384.0 seconds.
Total time elapsed 384.0 seconds.
You are computing a difference of two large numbers. But with the default real, you do not get sufficient precision.
Compare
Total time elapsed 1717144796.8 1717144796.7 seconds.
Total time elapsed 1717144797.0 1717144796.7 seconds.
Total time elapsed 1717144797.1 1717144796.7 seconds.
Total time elapsed 1717144797.2 1717144796.7 seconds.
Total time elapsed 1717144797.4 1717144796.7 seconds.
Total time elapsed 1717144797.5 1717144796.7 seconds.
Total time elapsed 1717144797.6 1717144796.7 seconds.
Total time elapsed 1717144797.8 1717144796.7 seconds.
Total time elapsed 1717144797.9 1717144796.7 seconds.
Total time elapsed 1717144798.1 1717144796.7 seconds.
Total time elapsed 1717144798.2 1717144796.7 seconds.
Total time elapsed 1717144798.3 1717144796.7 seconds.
Total time elapsed 1717144798.5 1717144796.7 seconds.
Total time elapsed 1717144798.6 1717144796.7 seconds.
Total time elapsed 1717144798.7 1717144796.7 seconds.
Total time elapsed 1717144798.9 1717144796.7 seconds.
with double precision
and
Total time elapsed 1717144832.0 1717144832.0 seconds.
Total time elapsed 1717144832.0 1717144832.0 seconds.
Total time elapsed 1717144832.0 1717144832.0 seconds.
Total time elapsed 1717144832.0 1717144832.0 seconds.
Total time elapsed 1717144832.0 1717144832.0 seconds.
Total time elapsed 1717144832.0 1717144832.0 seconds.
with real
, for a modified code
write(*,"(A,F15.1,F15.1,A)") "Total time elapsed",midTime,startTime," seconds."
You need sufficient precision to store the floating point number times. The timers are specific to different compiler suites on different operating systems and will differ. I get different results with Intel and GCC. If you get larger or smaller starting times you will see smaller or larger effective resolution when using real
.
The actual precision of the timer is given by omp_get_wtick()
but is only achievable when using high enough kind of a real
to store the time.