mpislurmsbatchintel-mpi

Pass IntelMPI flag to SLURM environment


How is it possible to pass the IntelMPI flag -print-rank-map as input to the srun command or as an environment variable into the batch script which is submitted in a SLURM system via the sbatch command?


Solution

  • Using export I_MPI_DEBUG=4 along with a knowledge of which core IDs belong to which sockets allows you to get this information. For example, I can get the mapping between sockets and core IDs from lscpu:

    [auser@login3 ~]$ lscpu
    Architecture:        x86_64
    CPU op-mode(s):      32-bit, 64-bit
    Byte Order:          Little Endian
    CPU(s):              72
    On-line CPU(s) list: 0-71
    Thread(s) per core:  2
    Core(s) per socket:  18
    Socket(s):           2
    NUMA node(s):        2
    Vendor ID:           GenuineIntel
    CPU family:          6
    Model:               79
    Model name:          Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
    Stepping:            1
    CPU MHz:             3091.353
    CPU max MHz:         3300.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            4199.86
    Virtualization:      VT-x
    L1d cache:           32K
    L1i cache:           32K
    L2 cache:            256K
    L3 cache:            46080K
    NUMA node0 CPU(s):   0-17,36-53
    NUMA node1 CPU(s):   18-35,54-71
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
    

    As this is an Intel Broadwell CPU, the NUMA regions correspond to sockets:

    NUMA node0 CPU(s):   0-17,36-53
    NUMA node1 CPU(s):   18-35,54-71
    

    Setting export I_MPI_DEBUG=4 gives the following type of information from which I can work out that ranks 0-17 are bound to socket 0 and ranks 18-35 are bound to socket 1.

    [0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9  Build 20200923 (id: abd58e492)
    [0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation.  All rights reserved.
    [0] MPI startup(): library kind: release
    [0] MPI startup(): libfabric version: 1.10.1-impi
    [0] MPI startup(): libfabric provider: verbs;ofi_rxm
    [0] MPI startup(): Rank    Pid      Node name  Pin cpu
    [0] MPI startup(): 0       2685376  r1i7n14    0
    [0] MPI startup(): 1       2685377  r1i7n14    1
    [0] MPI startup(): 2       2685378  r1i7n14    2
    [0] MPI startup(): 3       2685379  r1i7n14    3
    [0] MPI startup(): 4       2685380  r1i7n14    4
    [0] MPI startup(): 5       2685381  r1i7n14    5
    [0] MPI startup(): 6       2685382  r1i7n14    6
    [0] MPI startup(): 7       2685383  r1i7n14    7
    [0] MPI startup(): 8       2685384  r1i7n14    8
    [0] MPI startup(): 9       2685385  r1i7n14    9
    [0] MPI startup(): 10      2685386  r1i7n14    10
    [0] MPI startup(): 11      2685387  r1i7n14    11
    [0] MPI startup(): 12      2685388  r1i7n14    12
    [0] MPI startup(): 13      2685389  r1i7n14    13
    [0] MPI startup(): 14      2685390  r1i7n14    14
    [0] MPI startup(): 15      2685391  r1i7n14    15
    [0] MPI startup(): 16      2685392  r1i7n14    16
    [0] MPI startup(): 17      2685393  r1i7n14    17
    [0] MPI startup(): 18      2685394  r1i7n14    18
    [0] MPI startup(): 19      2685395  r1i7n14    19
    [0] MPI startup(): 20      2685396  r1i7n14    20
    [0] MPI startup(): 21      2685397  r1i7n14    21
    [0] MPI startup(): 22      2685398  r1i7n14    22
    [0] MPI startup(): 23      2685399  r1i7n14    23
    [0] MPI startup(): 24      2685400  r1i7n14    24
    [0] MPI startup(): 25      2685401  r1i7n14    25
    [0] MPI startup(): 26      2685402  r1i7n14    26
    [0] MPI startup(): 27      2685403  r1i7n14    27
    [0] MPI startup(): 28      2685404  r1i7n14    28
    [0] MPI startup(): 29      2685405  r1i7n14    29
    [0] MPI startup(): 30      2685406  r1i7n14    30
    [0] MPI startup(): 31      2685407  r1i7n14    31
    [0] MPI startup(): 32      2685408  r1i7n14    32
    [0] MPI startup(): 33      2685409  r1i7n14    33
    [0] MPI startup(): 34      2685410  r1i7n14    34
    [0] MPI startup(): 35      2685411  r1i7n14    35