numamemory-bandwidth

Does NUMA impact memory bandwidth, or just latency?


I have a problem that is memory bandwidth limited -- I need to read a lot (many GB) of data sequentially from RAM, do some quick processing and write it sequentially to a different location in RAM. Memory latency is not a concern.

Is there any benefit from dividing the work between two or more cores in different NUMA zones? Equivalently, does working across zones reduce the available bandwidth?


Solution

  • For bandwidth-limited, multi-threaded code, the behavior in a NUMA system will primarily depend how "local" each thread's data accesses are, and secondarily on details of the remote accesses.

    In a typical 2-socket server system, the local memory bandwidth available to two NUMA nodes is twice that available to a single node. (But remember that it may take many threads running on many cores to reach asymptotic bandwidth for each socket.)

    The STREAM Benchmark, for example, is typically run in a configuration that allows almost all accesses from every thread to be "local". This is implemented by assuming "first touch" NUMA placement -- when allocated memory is first written, the OS has to create mappings from the process virtual address space to physical addresses, and (by default) the OS chooses physical addresses that are in the same NUMA node as the core that executed the store instruction.

    "Local" bandwidth (to DRAM) in most systems is approximately symmetric (for reads and writes) and relatively easy to understand. "Remote" bandwidth is much more asymmetric for reads and writes, and there is usually significant contention between the read/write commands going between the chips and the data moving between the chips. The overall ratio of local to remote bandwidth also varies significantly across processor generations. For some processors (e.g., Xeon E5 v3 and probably v4), the interconnect is relatively fast, so jobs with poor locality can often be run with all of the memory interleaved between the two sockets. Local bandwidths have increased significantly since then, with more recent processors generally strongly favoring local access.

    Example from the Intel Xeon Platinum 8160 (2 UPI links between chips):

    It gets more complicated with combined read and write traffic between sockets, because the read traffic from node 0 to node 1 competes with the write traffic from node 1 to node 0, etc.

    Of course different ratios of local to remote accesses will change the scaling. Timing can also be an issue -- if the threads are doing local accesses at the same time, then remote accesses at the same time, there will be more contention in the remote access portion (compared to a case in which the threads are doing their remote accesses at different times).