c++linuxperformance-testingmicrobenchmarkmemory-bandwidth

Simple streaming loop shows higher effective B/W than DRAM B/W for small enough problems


I'm doing some "cold" micro-benchmarking i.e., a function is called say 50 times, its data is newly allocated in each run and each worker thread zeroes the first datum in each memory page of the data assigned to it, thereby ruling out the possibility of measuring page-faults.

Here's the minimum reproducer (for Linux systems):

/// \file This file contains a micro-benchmark program for the SAXPY loop.

#include <string> // std::stoul
#include <utility> // std::abort
#include <iostream> // std::cerr
#include <cstdint> // std::uint64_t
#include <algorithm> // std::min
#include <limits> // std::numeric_limits

#include <omp.h>

#include <stdlib.h> // posix_memalign
#include <unistd.h> // sysconf

void saxpy(const float& a, const float* vec_x, float* vec_y, const std::size_t& n)
{
    #pragma omp parallel for schedule(static)
    for (auto ii = std::size_t{}; ii < n; ++ii)
    {
        vec_y[ii] += vec_x[ii] * a; // fp_count: 2, traffic: 2+1
    }
}

int main(int argc, char** argv)
{
    // extract the problem size
    if (argc < 2)
    {
        std::cerr << "Please provide the problem size as command line argument." << std::endl;
        return 1;
    }
    const auto n = static_cast<std::size_t>(std::stoul(argv[1]));
    if (n < 1)
    {
        std::cerr << "Zero valued problem size provided. The program will now be aborted." << std::endl;
        return 1;
    }
    if (n * sizeof(float) / (1024 * 1024 * 1024) > 40) // let's assume there's only 64 GiB of RAM
    {
        std::cerr << "Problem size is too large. The program will now be aborted." << std::endl;
        return 1;
    }
    
    // report
    std::cout << "Starting runs with problem size n=" << n << ".\nThread count: " << omp_get_max_threads() << "."
        << std::endl;
    
    // details
    const auto page_size = sysconf(_SC_PAGESIZE);
    const auto page_size_float = page_size / sizeof(float);
    
    // experiment loop
    const auto experiment_count = 50;
    const auto warm_up_count = 10;
    const auto run_count = experiment_count + warm_up_count;
    auto durations = std::vector(experiment_count, std::numeric_limits<std::uint64_t>::min());
    const auto a = 10.f;
    float* vec_x = nullptr;
    float* vec_y = nullptr;
    for (auto run_index = std::size_t{}; run_index < run_count; ++run_index)
    {
        // allocate
        const auto alloc_status0 = posix_memalign(reinterpret_cast<void**>(&vec_x), page_size, n * sizeof(float));
        const auto alloc_status1 = posix_memalign(reinterpret_cast<void**>(&vec_y), page_size, n * sizeof(float));
        if (alloc_status0 != 0 || alloc_status1 != 0 || vec_x == nullptr || vec_y == nullptr)
        {
            std::cerr << "Fatal error, failed to allocate memory." << std::endl;
            std::abort();
        }
        
        // "first touch"
        #pragma omp parallel for schedule(static)
        for (auto ii = std::size_t{}; ii < n; ii += page_size_float)
        {
            vec_x[ii] = 0.f;
            vec_y[ii] = 0.f;
        }
        
        // run experiment
        const auto t1 = omp_get_wtime();
        saxpy(a, vec_x, vec_y, n);
        const auto t2 = omp_get_wtime();
        const auto duration_in_us = static_cast<std::int64_t>((t2 - t1) * 1E+6);
        if (duration_in_us <= 0)
        {
            std::cerr << "Fatal error, no time elapsed in the test function." << std::endl;
            std::abort();
        }
        if (run_index + 1 > warm_up_count)
        {
            durations[run_index - warm_up_count] = static_cast<std::uint64_t>(duration_in_us);
        }
        
        // deallocate
        std::free(vec_x);
        std::free(vec_y);
        vec_x = nullptr;
        vec_y = nullptr;
    }
    
    // statistics
    auto min = std::numeric_limits<std::uint64_t>::max();
    auto max = std::uint64_t{};
    auto mean = std::uint64_t{};
    for (const auto& duration : durations)
    {
        min = std::min(min, duration);
        max = std::max(max, duration);
        mean += duration;
    }
    mean /= experiment_count;
    
    // report duration
    std::cout << "Mean duration:      " << mean << " us\n"
        << "Min. duration:      " << min << " us\n"
        << "Max. duration:      " << max << " us.\n";
    
    // compute effective B/W
    const auto traffic = 3 * n * sizeof(float);
    constexpr auto inv_gigi = 1.0 / static_cast<double>(1024 * 1024 * 1024);
    const auto traffic_in_gib = static_cast<double>(traffic) * inv_gigi;
    std::cout << "Traffic per run:    " << traffic << " B (" << traffic_in_gib << " GiB)\n" 
        << "Mean effective B/W: " << static_cast<double>(traffic_in_gib) / (static_cast<double>(mean) * 1E-6) << " GiB/s\n"
        << "Min. effective B/W: " << static_cast<double>(traffic_in_gib) / (static_cast<double>(max) * 1E-6) << " GiB/s\n"
        << "Max. effective B/W: " << static_cast<double>(traffic_in_gib) / (static_cast<double>(min) * 1E-6) << " GiB/s\n"
        << std::endl;
    
    return 0;
}

Now I'm interested in seeing how the effective bandwidth behaves w.r.t. the problem size. So I did some runs with the following output:

Starting run for n=1000000.
Starting runs with problem size n=1000000.
Thread count: 6.
Mean duration:      148 us
Min. duration:      117 us
Max. duration:      417 us.
Traffic per run:    12000000 B (0.0111759 GiB)
Mean effective B/W: 75.5126 GiB/s
Min. effective B/W: 26.8006 GiB/s
Max. effective B/W: 95.5203 GiB/s

----------------------

Starting run for n=10000000.
Starting runs with problem size n=10000000.
Thread count: 6.
Mean duration:      3311 us
Min. duration:      3262 us
Max. duration:      3382 us.
Traffic per run:    120000000 B (0.111759 GiB)
Mean effective B/W: 33.7538 GiB/s
Min. effective B/W: 33.0452 GiB/s
Max. effective B/W: 34.2608 GiB/s

----------------------

Starting run for n=100000000.
Starting runs with problem size n=100000000.
Thread count: 6.
Mean duration:      32481 us
Min. duration:      32137 us
Max. duration:      36431 us.
Traffic per run:    1200000000 B (1.11759 GiB)
Mean effective B/W: 34.4074 GiB/s
Min. effective B/W: 30.6768 GiB/s
Max. effective B/W: 34.7757 GiB/s

----------------------

As a reference, I used IntelĀ® MLC which produced the following output:

Intel(R) Memory Latency Checker - v3.9a
*** Unable to modify prefetchers (try executing 'modprobe msr')
*** So, enabling random access for latency measurements
Measuring idle latencies (in ns)...
        Numa node
Numa node        0  
       0      57.1  

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :  41302.4 
3:1 Reads-Writes :  37185.7 
2:1 Reads-Writes :  36779.7 
1:1 Reads-Writes :  36790.3 
Stream-triad like:  37241.2

Question: How is it that for a problem size of n=1000000, we have an effective bandwidth (~75 GiB/s) higher than the maximum read-write bandwidth (~37 GiB/s) of my system? If you need additional information, please ask and I will gladly provide it. Thanks in advance.

Additional info: This was run on the Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz with 6 cores (no SMT) with cache sizes 32 KiB, 1.5 MiB and 9 MiB and highest vector flag AVX2.


Solution

  • The page-fault handler writes your pages to zero them, so your init loop actually does touch all cache lines in all pages, making some cache hits possible.


    12 MB (11.44 MiB) is only a bit larger than the 9 MiB L3 cache size of your hexa-core Coffee Lake i5-8400. You can still be getting significant L3 hits on "lucky" runs, and it's normal that multi-threaded L3 bandwidth is much higher than DRAM bandwidth.

    Most plots of bandwidth vs. working-set size fit a curve and don't sample a lot of points very close to the cache size, but I think even if you did, it's normal that it's not very steep cliff. Especially for the limits of L3 size, although some of the reason for a curve is system-wise contention for it from other cores. But getting some hits from an array slightly larger than L3 size is more likely than with L2 cache: different eviction algorithms, including adaptive replacement in L3 in Intel since Ivy Bridge, can perhaps decide to keep some L3 lines around in a set even when many others are getting evicted. And yes it makes sense that some runs could get a lot more lucky than others.

    How are we getting cache hits when we only wrote one line per page?

    These are freshly-allocated pages from the kernel, since glibc's allocator will use mmap(MAP_ANONYMOUS) for large mappings. (And give them back on free, not just putting them in a free list in user-space if they're above that size threshold.) Linux mmap is "lazy"1, there's no physical page backing the virtual allocation until you touch it.

    If the first access was read only, it would be copy-on-write mapped to a shared physical page of zeros. (So you could get L1d hits + TLB misses, or L3 hits if the uses transparent hugepages.) See Idiomatic way of performance evaluation? for some more links about this.

    On write (whether the first access or after a read), the page fault handler will find a free physical page, zeros it, then wires up the page tables to point to it, and returns to user-space for the store instruction to re-run. Zeroing the page leaves it hot in all levels of cache. (Or mostly just L3 if using a 2M hugepage).

    Fault-around will do the same for surrounding pages, perhaps reducing the number of page faults to maybe 1/8th of the number of pages touched with sequential accesses. But they'll still all be written by the kernel as part of the init loop.

    There's no way to get non-zeroed pages from the kernel, for security reasons that should be obvious if you think about the consequences of leaking contents of kernel and other-user memory pages to some user-space process. (For some embedded system use-cases, Linux has a MAP_UNINITIALIZED flag, but kernel support for it is disabled in normal builds.)

    How to avoid cached memory

    You could use _mm_stream_ps NT stores in your init loop. That bypasses cache and evicts if data was previously hot. (Eviction isn't guaranteed on Pentium-M or earlier IIRC, but it is on all x86-64.)

    Or use _mm_clflushopt after zeroing.


    With perf stat or time you'll probably see that your program spends a bit more than half its time in the kernel, like your other recent question where on my system it also spends more than half its time on sys instead of usr. (That's for the whole program, not just its timed regions, so not a bug per-se, but a way to see some of that's going on.)

    Footnote 1: Unless you used MAP_POPULATE, but glibc malloc doesn't because it's convenient to be able to make huge allocations and only touch the part you need, among other reasons. Most programs that use mmap manually make the same choice.