mpinuma

Open MPI 4.0 core and memory binding


I need some hints on how to achieve the core and memory bindings I have in mind, using Open MPI 4.0.1. This is for a single compute node with 8 NUMA nodes and 64 cores, namely 2x AMD Epyc 7551, SMT disabled.

The cores on this system are numbered according to the following scheme:

enter image description here

Now I have 3 different binding policies in mind, let's call them "close", "spread" and "scatter". I'll give 3 examples for each one with 6,16 and 48 threads to make my idea clear (hopefully). But I need methods that work with arbitrary numbers of MPI threads between 1 and 64.

1: "close" The idea here is to keep the threads as close as possible, i.e. minimising core-core latency. enter image description hereenter image description hereenter image description here

2: "spread" With the idea to make use of all available memory bandwidth

enter image description hereenter image description hereenter image description here

3: "scatter" The idea behind this is that each NUMA node is divided again into 2 groups of 4 cores, where each group has its own L3 cache. Compared to "spread", this policy should maximise the amount of L3 cache available to each thread.

enter image description hereenter image description hereenter image description here

Which arguments do I need to pass to mpirun in order to achieve each of these 3 policies? Or any other method if this can not be achieved without the help of e.g. machinefiles.


Solution

  • I do not have the hardware to test it, so I cannot guarantee this is a correct answer

    you can also mpirun --report-bindings ... in order to see how the MPI tasks were pinned by Open MPI.