I am using a machine with 2 Xeon CPUs having 16 cores each. There are 2 NUMA domains, one for each CPU.
I have intensive computation, that also use a lot of memory, and everything is multithreaded. The global structure of the code is:
!$OMP PARALLEL DO
do i = 1, N
!$OMP CRITICAL
! allocate memory and populate it
!$OMP END CRITICAL
end do
!$OMP END PARALLEL DO
...
...
!$OMP PARALLEL DO
do i = 1, N
! main computations part 1
end do
!$OMP END PARALLEL DO
...
...
!$OMP PARALLEL DO
do i = 1, N
! main computations part 2
end do
!$OMP END PARALLEL DO
...
...
N is typically ~10000, and each iteration requires a few seconds.
About 50% of the data read/written in the computations at iteration #i are in the memory previsouly allocated at the iteration #i, and the remaining 50% are in the memory allocated in other iterations (but which tend to be close to #i).
Using the same static scheduling for all the loops ensures that at a given iteration #i, 50% of the memory that is accessed during the computations has been allocated by the same thread than the processed the iteration, hence that it is in the same NUMA domain.
Moreover, binding the threads with OMP_PROC_BIND and OMP_PLACES (threads 0-15 on the CPU #0 and threads 16-31 on the CPU #1) ensures that adjacent iterations have likely their allocated memory in the same NUMA domain.
So far so good...
The only issue is that the computational workload is not well balanced betwen the iterations. It's not too bad, but there can be up to +/-20%... Usually, using some dynamic scheduling at the computation stages would help, but here it would defeat the whole strategy of having the same thread allocating and then computing the iteration #i.
At least, I would like the iterations 1...N/2 being processed by the threads 0-15 and the iterations N/2+1...N being processed by the threads 16-31. So a first level of static chunking (2 chunks of size N/2), and a second level of dynamic scheduling inside each chunk. This would at least ensure that each thread would access memory mostly in the same NUMA domain.
But I can't see how to do that at all with OpenMP... Is it possible?
EDIT: schedule(nonmonotonic:dynamic)
could have been a solution here, but on the HPC cluster I am using, I am stuck with compiler versions (Intel compiler 2021 at best) that do not implement this scheduling.
The specific Intel compiler compiler version should support "static stealing". To enable it, you need to use schedule(runtime)
with the parallel do
directive, like so:
!$omp parallel do schedule(runtime)
When running the application, set OMP_SCHEDULE=static_steal
as an environment variable before you start the application, e.g, for bash-like shells:
export OMP_SCHEDULE=static_steal
or via localize environment:
OMP_SCHEDULE=static_steal ./my-application
The loop is then partitioned statically at first, but when threads run out of work, they can steal from other threads. Does that solve your problem?