linuxprofilingbandwidthmemory-profilingnuma

How to monitor NUMA interconnection (QPI/UPI) bandwidth usage of a process in Linux?


How can I measure how much QPI/UPI bandwidth a process is using between two NUMA nodes in Linux?

Let's say my process has a thread on NUMA node 0 and another thread on NUMA node 1 each accessing their data on the other NUMA node throttling the QPI/UPI bandwidth. How to measure this bandwidth usage?

I have a machine with 2x Intel skylake processors which use UPI technology but I think the solutions would be the same for QPI as well (not sure!).


Solution

  • You want to measure the traffic (bandwidth) generated by memory accesses between two Non-Uniform Memory Access (NUMA) nodes (aka 'remote memory accesses' or 'NUMA accesses'). When a processor needs to access data which is stored in a memory managed by a different processor, a point-to-point processor interconnect like the Intel Ultra Path Interconnect (UPI) is utilized.

    Collecting the UPI (or QPI) bandwidth for a specific process/thread can get tricky.

    Per UPI link bandwidth (the CPU socket granularity)

    Processor Counter Monitor (PCM) provides a number of command-line utilities for real-time monitoring. For instance the pcm binary displays a per socket UPI traffic estimation. Depending on the required precision (and NUMA traffic generated by other processes), it might be enough to understand if the UPI links are saturated.

    Intel Memory Latency Checker (MLC) can be used as a workload to check how PCM behaves when creating a maximum traffic between two NUMA nodes.

    For instance, using the workload generated by ./mlc --bandwidth_matrix -t15 (during a remote access phase), PCM displays the following with my 2-socket (Intel Cascade Lake) server node:

    Intel(r) UPI data traffic estimation in bytes (data traffic coming to CPU/socket through UPI links):
    
                   UPI0     UPI1     UPI2    |  UPI0   UPI1   UPI2  
    ---------------------------------------------------------------------------------------------------------------
     SKT    0       17 G     17 G      0     |   73%    73%     0%  
     SKT    1     6978 K   7184 K      0     |    0%     0%     0%  
    ---------------------------------------------------------------------------------------------------------------
    Total UPI incoming data traffic:   34 G     UPI data traffic/Memory controller traffic: 0.96
    
    Intel(r) UPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through UPI links):
    
                   UPI0     UPI1     UPI2    |  UPI0   UPI1   UPI2  
    ---------------------------------------------------------------------------------------------------------------
     SKT    0     8475 M   8471 M      0     |   35%    35%     0%  
     SKT    1       21 G     21 G      0     |   91%    91%     0%  
    ---------------------------------------------------------------------------------------------------------------
    Total UPI outgoing data and non-data traffic:   59 G
    MEM (GB)->|  READ |  WRITE | LOCAL | PMM RD | PMM WR | CPU energy | DIMM energy | LLCRDMISSLAT (ns) UncFREQ (Ghz)
    ---------------------------------------------------------------------------------------------------------------
     SKT   0     0.19     0.05   92 %      0.00      0.00      87.58      13.28         582.98 2.38
     SKT   1    36.16     0.01    0 %      0.00      0.00      66.82      21.86         9698.13 2.40
    ---------------------------------------------------------------------------------------------------------------
           *    36.35     0.06    0 %      0.00      0.00     154.40      35.14         585.67 2.39
    

    Monitoring the NUMA traffic (CPU core granularity)

    PCM also displays per core remote traffic in MB/s (i.e. NUMA traffic). See RMB column:

    RMB : L3 cache external bandwidth satisfied by remote memory (in MBytes)

     Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI |   L3OCC |   LMB  |   RMB  | TEMP
    
       0    0     0.04   0.04   1.00    1.00    1720 K   1787 K    0.04    0.55  0.0167  0.0173      800        1      777     49
       1    0     0.04   0.04   1.00    1.00    1750 K   1816 K    0.04    0.55  0.0171  0.0177      640        5      776     50
       2    0     0.04   0.04   1.00    1.00    1739 K   1828 K    0.05    0.55  0.0169  0.0178      720        0      777     50
       3    0     0.04   0.04   1.00    1.00    1721 K   1800 K    0.04    0.55  0.0168  0.0175      240        0      784     51
    <snip>
    ---------------------------------------------------------------------------------------------------------------
     SKT    0     0.04   0.04   1.00    1.00      68 M     71 M    0.04    0.55  0.0168  0.0175    26800        8    31632     48
     SKT    1     0.02   0.88   0.03    1.00      66 K   1106 K    0.94    0.13  0.0000  0.0005    25920        4       15     52
    ---------------------------------------------------------------------------------------------------------------
     TOTAL  *     0.03   0.06   0.51    1.00      68 M     72 M    0.05    0.54  0.0107  0.0113     N/A     N/A     N/A      N/A
    

    The per core remote traffic can be used to gather thread level NUMA traffic.

    Method to estimate NUMA throughput generated between threads

    1. You need to ensure that the threads generating the NUMA traffic are bound to dedicated cores. That can be done programmatically or you can rebind the threads by using tools like hwloc-bind.

    2. Ensure other processes are bound to different cpu cores (scripts like cpusanitizer might be useful to periodically scan all processes and modify their CPU core affinity). Note: pay attention to the hyperthreads. You don't want that the threads you need to monitor share the same CPU cores with other processes.

    3. Check the remote traffic (PCM RMB column) generated on the cores on which you attached the threads you want to monitor.