I have a code that works fine that's compiled w/ gcc/gfortran-14 (installed via brew). The vast majority of the time is spent doing FFT via FFTW. In the critical part of the code I have this:
!$omp parallel
!$omp sections
!$omp section
call fft(xi3d,xo3d,.false.,xip,layered_p)
!$omp section
call fft(yi3d,yo3d,.false.,yip,layered_p)
!$omp section
call fft(zi3d,zo3d,.false.,zip,layered_p)
!$omp end sections
!$omp sections
!$omp section
x3d=n3d(:,:,:,1,1)*xo3d+n3d(:,:,:,2,1)*yo3d+n3d(:,:,:,3,1)*zo3d
!$omp section
y3d=n3d(:,:,:,4,1)*xo3d+n3d(:,:,:,5,1)*yo3d+n3d(:,:,:,6,1)*zo3d
!$omp section
z3d=n3d(:,:,:,7,1)*xo3d+n3d(:,:,:,8,1)*yo3d+n3d(:,:,:,9,1)*zo3d
!$omp end sections
!$omp sections
!$omp section
call fft(rx3d,x3d,.true.,xp,layered_p)
!$omp section
call fft(ry3d,y3d,.true.,yp,layered_p)
!$omp section
call fft(rz3d,z3d,.true.,zp,layered_p)
!$omp end sections
!$omp end parallel
where x/y/zi3d
and rx/y/z3d
are real 3d arrays and all others are complex 3d arrays (all created with fftw_alloc_real/complex
). The fft
subroutine simply calls fftw_plan_dft_c2r_3d
or fftw_plan_dft_r2c_3d
depending on the third parameter (i.e. .true.
or .false.
).
I usually run w/ three threads. so the FFT are run in parallel. This works really well and the code does see an almost 3x performance improvement.
Now I'm trying to use OpenMP with FFTW itself. That is tell FFTW to use 2 or 3 threads in each fft instead of one. So at the start of the code I do the required initialization:
if (fftw_init_threads().eq.0) then
print *,'fftw has problem: fftw_init_threads'
else
call fftw_plan_with_nthreads(max(1,omp_get_max_threads()/3))
print *,'each fftw will use ',max(1,omp_get_max_threads()/3),' threads'
! call omp_set_nested(.true.)
endif
However, when I run the code (even after setting export OMP_NUM_THREADS=6
or export OMP_NUM_THREADS=3,2
, the code runs slower. Setting to 6, top
tells me only three processors are running. The other setting shows like 4.5 processors running, but in both cases the code runs slower than just using 3 threads and not using FFTW OpenMP.
The size of the arrays I'm using is 512x512x256. I'm on a MacBookPro w/ M1Pro and 32GB of memory.
Do I need the omp_set_nested(.true.)
or not? What can I do to see all 6 processors pegged while doing the FFTW (which, in this code is over 95% of the time!)?
Any suggestions on what to try.
From OpenMP perspective, you want to execute two nested parallel regions with 3 threads at the outer level which then execute n threads at the inner level. Starting with OpenMP 5.0, the use of nested-var
and its API was deprecated and is now controlled either by setting max-active-levels-var
, or by the elements in the nthreads-var
list (3,n
in your case). omp_get_max_threads()
is defined to provide the head value of the nthreads-var
list. So from serial context it will provide you 3.
In your code, you therefore either tell fftw to plan for execution with 2 threads (OMP_NUM_THREADS=6
), but then execute with a single thread, or you tell fftw to plan for execution with 1 thread (OMP_NUM_THREADS=3,2
) and then execute with 2 threads.
To plan properly for fftw execution, the tricky part is to access n
from the list. You can spawn a parallel region to emulate your 3-sections region and pop the 3 from the nthreads-var
list. At this point you can access the second value using omp_get_max_threads()
.
if (fftw_init_threads().eq.0) then
print *,'fftw has problem: fftw_init_threads'
else
!$omp parallel
!$omp single
call fftw_plan_with_nthreads(max(1,omp_get_max_threads()))
print *,'each fftw will use ',max(1,omp_get_max_threads()),' threads'
!$omp end single
!$omp end parallel
endif
GCC implements nesting based on a nthreads-var
list at least since version 9, so you don't need to additionally call omp_set_nested