How can I measure how much QPI/UPI bandwidth a process is using between two NUMA nodes in Linux?
Let's say my process has a thread on NUMA node 0 and another thread on NUMA node 1 each accessing their data on the other NUMA node throttling the QPI/UPI bandwidth. How to measure this bandwidth usage?
I have a machine with 2x Intel skylake processors which use UPI technology but I think the solutions would be the same for QPI as well (not sure!).
You want to measure the traffic (bandwidth) generated by memory accesses between two Non-Uniform Memory Access (NUMA) nodes (aka 'remote memory accesses' or 'NUMA accesses'). When a processor needs to access data which is stored in a memory managed by a different processor, a point-to-point processor interconnect like the Intel Ultra Path Interconnect (UPI) is utilized.
Collecting the UPI (or QPI) bandwidth for a specific process/thread can get tricky.
Processor Counter Monitor (PCM) provides a number of command-line utilities for real-time monitoring. For instance the pcm binary displays a per socket UPI traffic estimation. Depending on the required precision (and NUMA traffic generated by other processes), it might be enough to understand if the UPI links are saturated.
Intel Memory Latency Checker (MLC) can be used as a workload to check how PCM behaves when creating a maximum traffic between two NUMA nodes.
For instance, using the workload generated by ./mlc --bandwidth_matrix -t15
(during a remote access phase), PCM displays the following with my 2-socket (Intel Cascade Lake) server node:
Intel(r) UPI data traffic estimation in bytes (data traffic coming to CPU/socket through UPI links):
UPI0 UPI1 UPI2 | UPI0 UPI1 UPI2
---------------------------------------------------------------------------------------------------------------
SKT 0 17 G 17 G 0 | 73% 73% 0%
SKT 1 6978 K 7184 K 0 | 0% 0% 0%
---------------------------------------------------------------------------------------------------------------
Total UPI incoming data traffic: 34 G UPI data traffic/Memory controller traffic: 0.96
Intel(r) UPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through UPI links):
UPI0 UPI1 UPI2 | UPI0 UPI1 UPI2
---------------------------------------------------------------------------------------------------------------
SKT 0 8475 M 8471 M 0 | 35% 35% 0%
SKT 1 21 G 21 G 0 | 91% 91% 0%
---------------------------------------------------------------------------------------------------------------
Total UPI outgoing data and non-data traffic: 59 G
MEM (GB)->| READ | WRITE | LOCAL | PMM RD | PMM WR | CPU energy | DIMM energy | LLCRDMISSLAT (ns) UncFREQ (Ghz)
---------------------------------------------------------------------------------------------------------------
SKT 0 0.19 0.05 92 % 0.00 0.00 87.58 13.28 582.98 2.38
SKT 1 36.16 0.01 0 % 0.00 0.00 66.82 21.86 9698.13 2.40
---------------------------------------------------------------------------------------------------------------
* 36.35 0.06 0 % 0.00 0.00 154.40 35.14 585.67 2.39
PCM also displays per core remote traffic in MB/s (i.e. NUMA traffic). See RMB column:
RMB : L3 cache external bandwidth satisfied by remote memory (in MBytes)
Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | L3OCC | LMB | RMB | TEMP
0 0 0.04 0.04 1.00 1.00 1720 K 1787 K 0.04 0.55 0.0167 0.0173 800 1 777 49
1 0 0.04 0.04 1.00 1.00 1750 K 1816 K 0.04 0.55 0.0171 0.0177 640 5 776 50
2 0 0.04 0.04 1.00 1.00 1739 K 1828 K 0.05 0.55 0.0169 0.0178 720 0 777 50
3 0 0.04 0.04 1.00 1.00 1721 K 1800 K 0.04 0.55 0.0168 0.0175 240 0 784 51
<snip>
---------------------------------------------------------------------------------------------------------------
SKT 0 0.04 0.04 1.00 1.00 68 M 71 M 0.04 0.55 0.0168 0.0175 26800 8 31632 48
SKT 1 0.02 0.88 0.03 1.00 66 K 1106 K 0.94 0.13 0.0000 0.0005 25920 4 15 52
---------------------------------------------------------------------------------------------------------------
TOTAL * 0.03 0.06 0.51 1.00 68 M 72 M 0.05 0.54 0.0107 0.0113 N/A N/A N/A N/A
The per core remote traffic can be used to gather thread level NUMA traffic.
You need to ensure that the threads generating the NUMA traffic are bound to dedicated cores. That can be done programmatically or you can rebind the threads by using tools like hwloc-bind.
Ensure other processes are bound to different cpu cores (scripts like cpusanitizer might be useful to periodically scan all processes and modify their CPU core affinity). Note: pay attention to the hyperthreads. You don't want that the threads you need to monitor share the same CPU cores with other processes.
Check the remote traffic (PCM RMB column) generated on the cores on which you attached the threads you want to monitor.