c++linuxgccraspberry-pi4high-resolution-clock

Raspberry PI benchmark timings oddities with std::chrono::steady_clock


I'm trying to benchmark a piece of DSP code on a Raspberry Pi 4 using the std::chrono::steady_clock, but the results I'm getting are peculiar. Because GNU profiling tools don't work on Raspberry Pi, I'm stuck with benchmarking to evaluate code optimizations, so this is rather a big deal.

What would cause performance to vary by 10% between executions of the benchmark program, while remaining consistent +/- 1% when the same test is run multiple times in the same execution of the program?

Results for a ~6-second benchmark vary by ~10%. But the peculiar thing is that the variance seems to be sticky for a particular execution of the benchmark. I run the benchmark three times in a row each time the program is run, and get roughly the same results +/- 1%. But when I re-run the program the results of the three benchmarks vary by +/- 10% from the previous run, but with each of the three results in the new run being +/- 1%.

e.g:

Run 1:
9:21:37. Performance: 0.0912333 x realtime
9:21:42. Performance: 0.0910667 x realtime
9:21:47. Performance: 0.0910667 x realtime

Run 2:
9:20:15. Performance: 0.106667 x realtime
9:20:21. Performance: 0.1062 x realtime
9:20:28. Performance: 0.106117 x realtime

Results vary randomly roughly between those two extremes for each run. But the peculiar thing here is that the results are consistent to +/- 1% between the three tests performed each time the program is run.

I'm an experienced programmer, so I get that benchmarks will vary somewhat. But the ~10% variance is unworkable for what I'm trying to do. And I am unable to come up with a reasonable theory as to why the variances changes from invocation to invocation.

The code under test is a machine-learning algorithm (LSTM->Dense), using hand-optimized neon intrinsics used to generate real-time audio. The bulk of the execution (~90%) is matrix and vector arithmetic using hand-optimized neon intrinsics. Data footprint is about 13kb (fits comfortably in L1 d-cache). Code footprint is unknown, but may not fit in L1 i-cache. Most of the code pipelines beautifully, so code may be running close L1-cache bandwidth limits. So far, optimizaton has resulted in an improvement from ~0.18 x realtime, to 0.093 x realtime. I think there's probably another ~15% improvement available, but the timing inaccuracies are getting in the way at this point. The code under test gets executed three times, taking ~0.3 x realtime, so further optimizations are in fact critical.

Things that have been checked:

Verified cpu frequency:

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor 
performance
performance
performance
performance

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq 
1800000
1800000
1800000
1800000

Things that I don't know that you may be able to help with:

Technical Details:


Solution

  • Finally identified the source of the problem. The issue seems to be very mild competition for L1 cache contents, presumably from some background system process.

    Performance counters exhibit the same odd behavior as the benchmarks do: 3 consecutive runs each time the test program is launched show ~1% variance in benchmark results; but results vary by ~10% for launch to launch.

    It is peculiar that the differences in performance between runs of the test are consistent and persist over several seconds. But given how minor the disturbance of the L1 cache is, it's hard to guess what of a hundred-odd running system processes are interfering with the benchmark, and why that rather unfortunate pattern manifests, especially since they could be running at any scheduler priority.

    Results of performance counters measurements illustrate the problem: an average of ~30 extra L1 data cache misses per iteration of a function with 2,995 instructions accounts for a 10% variance in benchmark results. Surprisingly.

    I am unable to guess what sort of system process would pollute the L1 data cache at a rate that remains consistent over an 18 second period, but varies on larger timescales.

    The good news: the code under tests is remarkably close to optimal. (An LSTM unit with two substantial multiplies, and a significant number of vectorized ArcTan and Sigmoid function calls), which manages to use over 75% of available cache memory bandwidth, and issue almost two instructions per clock cycle. Woohoo!

    Test Data

    Performance counter measurements averaged per iteration of the test code. Each launch of the program runs the ~six-second benchmark three times.

    A good run of the test program:

    CpuClk      :           1,694
    L1D Access  :           1,244
    L1D Miss    :               6
    L1I Miss    :               0
    Instructions:           2,995
    L2 Access   :              12
    L2 Miss     :               0
    
    ---
    CpuClk      :           1,694
    L1D Access  :           1,244
    L1D Miss    :               6
    L1I Miss    :               0
    Instructions:           2,995
    L2 Access   :              12
    L2 Miss     :               0
    
    ---
    CpuClk      :           1,693
    L1D Access  :           1,244
    L1D Miss    :               6
    L1I Miss    :               0
    Instructions:           2,995
    L2 Access   :              12
    L2 Miss     :               0
    

    A bad run:

    CpuClk      :           1,797
    L1D Access  :           1,244
    L1D Miss    :              37
    L1I Miss    :               0
    Instructions:           2,995
    L2 Access   :              78
    L2 Miss     :               0
    
    ---
    CpuClk      :           1,794
    L1D Access  :           1,244
    L1D Miss    :              37
    L1I Miss    :               0
    Instructions:           2,995
    L2 Access   :              78
    L2 Miss     :               0
    
    ---
    CpuClk      :           1,797
    L1D Access  :           1,244
    L1D Miss    :              37
    L1I Miss    :               0
    Instructions:           2,995
    L2 Access   :              78
    L2 Miss     :               0