[SOLVED] Open MPI 4.0 core and memory binding

Open MPI 4.0 core and memory binding

I need some hints on how to achieve the core and memory bindings I have in mind, using Open MPI 4.0.1. This is for a single compute node with 8 NUMA nodes and 64 cores, namely 2x AMD Epyc 7551, SMT disabled.

The cores on this system are numbered according to the following scheme:

Now I have 3 different binding policies in mind, let's call them "close", "spread" and "scatter". I'll give 3 examples for each one with 6,16 and 48 threads to make my idea clear (hopefully). But I need methods that work with arbitrary numbers of MPI threads between 1 and 64.

1: "close" The idea here is to keep the threads as close as possible, i.e. minimising core-core latency.

2: "spread" With the idea to make use of all available memory bandwidth

3: "scatter" The idea behind this is that each NUMA node is divided again into 2 groups of 4 cores, where each group has its own L3 cache. Compared to "spread", this policy should maximise the amount of L3 cache available to each thread.

Which arguments do I need to pass to mpirun in order to achieve each of these 3 policies? Or any other method if this can not be achieved without the help of e.g. machinefiles.

Solution

I do not have the hardware to test it, so I cannot guarantee this is a correct answer

"close" mpirun --bind-to core --rank-by core --map-by core ...
"spread" mpirun --bind-to core --rank-by core --map-by numa ...
"scatter" mpirun --bind-to core --rank-by core --map-by l3cache ...

you can also mpirun --report-bindings ... in order to see how the MPI tasks were pinned by Open MPI.