I need some hints on how to achieve the core and memory bindings I have in mind, using Open MPI 4.0.1. This is for a single compute node with 8 NUMA nodes and 64 cores, namely 2x AMD Epyc 7551, SMT disabled.
The cores on this system are numbered according to the following scheme:
Now I have 3 different binding policies in mind, let's call them "close", "spread" and "scatter". I'll give 3 examples for each one with 6,16 and 48 threads to make my idea clear (hopefully). But I need methods that work with arbitrary numbers of MPI threads between 1 and 64.
1: "close" The idea here is to keep the threads as close as possible, i.e. minimising core-core latency.
2: "spread" With the idea to make use of all available memory bandwidth
3: "scatter" The idea behind this is that each NUMA node is divided again into 2 groups of 4 cores, where each group has its own L3 cache. Compared to "spread", this policy should maximise the amount of L3 cache available to each thread.
Which arguments do I need to pass to mpirun in order to achieve each of these 3 policies? Or any other method if this can not be achieved without the help of e.g. machinefiles.
I do not have the hardware to test it, so I cannot guarantee this is a correct answer
mpirun --bind-to core --rank-by core --map-by core ...
mpirun --bind-to core --rank-by core --map-by numa ...
mpirun --bind-to core --rank-by core --map-by l3cache ...
you can also mpirun --report-bindings ...
in order to see how the MPI tasks were pinned by Open MPI.