I am trying to execute a large loop in C++ with #pragma omp parallel. I need to keep track of range index values i which have been tested already (checkpointing) because the job will need to be resumed if it gets interrupted (killed). (Redoing of some of the work is fine.) This code does good job on my intel based linux computer but on apple silicone openmp schedule is all over the place. See below. How it can be fixed? Or maybe there is a more advanced method of checkpointing in this context?
#include <iostream>
#include <omp.h>
#include <ostream>
#include <thread>
#include <random>
#include <chrono>
using namespace std;
int main() {
omp_set_num_threads(3);
#pragma omp parallel for schedule(dynamic, 100)
for (int i = 0; i < 20000; ++i){
int id=omp_get_thread_num();
//using puts since it is thread safe
if(!(i % 100)) puts((to_string(i)+": id:"+to_string(id)).c_str());
//here instead of actual job, I am creating a tiny random delay
mt19937_64 eng{random_device{}()}; //seed
uniform_int_distribution<> dist{10, 100};
this_thread::sleep_for(std::chrono::milliseconds{dist(eng)});
}
return 0;
}
Linux with Intel:
0: id:2
100: id:1
200: id:0
300: id:1
400: id:0
500: id:2
....
Mac + M3 processor (with the mac openmp library and clang compiler):
0: id:0
6700: id:1
13400: id:2
100: id:0
6800: id:1
13500: id:2
200: id:0
6900: id:1
This schedule is useless for checkpointing. I am trying to avoid static schedule since it is not efficient.
It seems that the compiler+openmp library on your Mac uses the nonmonotonic
modifier of the dynamic schedule: this modifier is relatively recent (appeared in OpenMP 4.5) and not implemented yet in all openmp libraries (it was not in gcc 13, I don't know in gcc 14).
On the Mac, try forcing the legacy monotonic behavior with: schedule(monotonic:dynamic, 100)
The OpenMP 5.2 specification says:
In practice, with schedule(monotonic:dynamic,chunksize)
schedule, each inactive thread is attributed the next unprocessed chunk of iterations, in order. With the schedule(nonmonotonic:dynamic,chunksize)
schedule, the iterations are pre-assigned to the threads as if schedule(static)
was used (regardless chunksize
), and once a thread has finished its pre-assigned work, it is attributed any of the unprocessed chunks (that is, it "steals" the chunks that were initially attributed to the other threads). The nonmonotonic:dynamic
schedule is meant to have at worst similar performances to the static schedule
when the workload is balanced beetween the iterations, and as good performances as the legacy monotonic:dynamic
schedule when when the workload is unbalanced.
See also this answer: https://stackoverflow.com/a/77799465/14778592