c++linuxcpuintelcpu-architecture

Slowing down CPU Frequency by imposing memory stress


I stressed my system to see how it affects some program i wrote using stress-ng.

The program itself is a neural network, mainly composed of some nested loops doing some multiplication and using about 1G of RAM overall coded in C++.

I imposed some memory stress on the system using:

stress-ng --vm 4 --vm-bytes 2G -t 100s

which creates 4 workers spinning on mmap allocating 2G of RAM each. This slows down the execution of my program significantly (from about 150ms to 250ms). But the reason for the program to slow down is not lack of memory or memory-bandwidth or something. Instead the CPU cycles decrease from 3.4GHz (without stress-ng) to 2.8GHz (with stress-ng). The CPU utilization stays about the same (99%), as expected.

I measured the CPU frequency using

sudo perf stat -B ./my_program

Does anybody know why memory stress slows down the CPU?

My CPU is an Intel(R) Core(TM) i5-8250U and my OS is Ubuntu 18.04.

kind regards lpolari


Solution

  • Skylake-derived CPUs do lower their core clock speed when bottlenecked on load / stores, at energy vs. performance settings that favour more powersaving. Surprisingly, you can construct artificial cases where this downclocking happens even with stores that all hit in L1d cache, or loads from uninitialized memory (still CoW mapped to the same zero pages).

    Skylake introduced full hardware control of CPU frequency (hardware P-state = HWP). https://unix.stackexchange.com/questions/439340/what-are-the-implications-of-setting-the-cpu-governor-to-performance Frequency decision can take into account internal performance-monitoring which can notice things like spending most cycles stalled, or what it's stalled on. I don't know what heuristic exactly Skylake uses.

    You can repro this1 by looping over a large array without making any system calls. If it's large (or you stride through cache lines in an artificial test), perf stat ./a.out will show the average clock speed is lower than for normal CPU-bound loops.


    In theory, if memory is totally not keeping up with the CPU, lowering the core clock speed (and holding memory controller constant) shouldn't hurt performance much. In practice, lowering the clock speed also lowers the uncore clock speed (ring bus + L3 cache), somewhat worsening memory latency and bandwidth as well.

    Part of the latency of a cache miss is getting the request from the CPU core to the memory controller, and single-core bandwidth is limited by max concurrency (outstanding requests one core can track) / latency. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

    e.g. my i7-6700k drops from 3.9GHz to 2.7GHz when running a microbenchmark that only bottlenecks on DRAM at default bootup settings. (Also it only goes up to 3.9GHz instead of 4.0 all-core or 4.2GHz with 1 or 2 cores active as configured in the BIOS, with the default balance_power EPP settings on boot or with balance_performance.)

    This default doesn't seem very good, too conservative for "client" chips where a single core can nearly saturate DRAM bandwidth, but only at full clock speed. Or too aggressive about powersaving, if you look at it from the other POV, especially for chips like my desktop with a high TDP (95W) that can sustain full clock speed indefinitely even when running power-hungry stuff like x265 video encoding making heavy use of AVX2.

    It might make more sense with a ULV 15W chip like your i5-8250U to try to leave more thermal / power headroom for when the CPU is doing something more interesting.


    This is governed by their Energy / Performance Preference (EPP) setting. It happens fairly strongly at the default balance_power setting. It doesn't happen at all at full performance, and some quick benchmarks indicate that balance_performance also avoids this powersaving slowdown. I use balance_performance on my desktop.

    "Client" (non-Xeon) chips before Ice Lake have all cores locked together so they run at the same clock speed (and will all run higher if even one of them is running something not memory bound, like a while(1) { _mm_pause(); } loop). But there's still an EPP setting for every logical core. I've always just changed the settings for all cores to keep them the same:

    On Linux, reading the settings:

    $ grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
    /sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference:balance_performance
    /sys/devices/system/cpu/cpufreq/policy1/energy_performance_preference:balance_performance
    ...
    /sys/devices/system/cpu/cpufreq/policy7/energy_performance_preference:balance_performance
    

    Writing the settings:

    echo balance_performance | sudo tee /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
    

    See also


    Footnote 1: experimental example:

    Store 1 dword per cache line, advancing through contiguous cache lines until end of buffer, then wrapping the pointer back to the start. Repeat for a fixed number of stores, regardless of buffer size.

    ;; t=testloop; nasm -felf64 "$t.asm" && ld "$t.o" -o "$t" && taskset -c 3 perf stat -d -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread ./"$t"
    
    ;; nasm -felf64 testloop.asm
    ;; ld -o testloop testloop.o
    ;; taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread -r1 ./testloop
    
    ; or idq.mite_uops 
    
    default rel
    %ifdef __YASM_VER__
    ;    CPU intelnop
    ;    CPU Conroe AMD
        CPU Skylake AMD
    %else
    %use smartalign
    alignmode p6, 64
    %endif
    
    global _start
    _start:
    
        lea        rdi, [buf]
        lea        rsi, [endbuf]
    ;    mov        rsi, qword endbuf           ; large buffer.  NASM / YASM can't actually handle a huge BSS and hit a failed assert (NASM) or make a binary that doesn't reserve enough BSS space.
    
        mov     ebp, 1000000000
    
    align 64
    .loop:
    %if 0
          mov  eax, [rdi]              ; LOAD
          mov  eax, [rdi+64]
    %else
          mov  [rdi], eax              ; STORE
          mov  [rdi+64], eax
    %endif
        add  rdi, 128
        cmp  rdi, rsi
        jae  .wrap_ptr        ; normally falls through, total loop = 4 fused-domain uops
     .back:
    
        dec ebp
        jnz .loop
    .end:
    
        xor edi,edi
        mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
        syscall       ; sys_exit_group(0)
    
    .wrap_ptr:
       lea  rdi, [buf]
       jmp  .back
    
    
    section .bss
    align 4096
    ;buf:    resb 2048*1024*1024 - 1024*1024     ; just under 2GiB so RIP-rel still works
    buf:    resb 1024*1024 / 64     ; 16kiB = half of L1d
    
    endbuf:
      resb 4096        ; spare space to allow overshoot
    

    Test system: Arch GNU/Linux, kernel 5.7.6-arch1-1. (And NASM 2.14.02, ld from GNU Binutils 2.34.0).

    Hyperthreading is enabled, but the system is idle and the kernel won't schedule anything on the other logical core (the sibling of the one I pinned it to), so it has a physical core to itself.

    However, this means perf is unwilling to use more programmable perf counters for one thread, so perf stat -d to monitor L1d loads and replacement, and L3 hit / miss would mean less accurate measuring for cycles and so on. It's negligible, like 424k L1-dcache-loads (probably in kernel page-fault handlers, interrupt handlers, and other overhead, because the loop has no loads). L1-dcache-load-misses is actually L1D.REPLACEMENT and is even lower, like 48k

    I used a few perf events, including exe_activity.bound_on_stores -[Cycles where the Store Buffer was full and no outstanding load]. (See perf list for descriptions, and/or Intel's manuals for more).

    EPP: balance_power: 2.7GHz downclock out of 3.9GHz

    EPP setting: balance_power with sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_power > "$i";done'

    There is throttling based on what the code is doing; with a pause loop on another core keeping clocks high, this would run faster on this code. Or with different instructions in the loop.

    # sudo ... balance_power
    $ taskset -c 3 perf stat -etask-clock:u,task-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,exe_activity.bound_on_stores -r1 ./"$t" 
    
     Performance counter stats for './testloop':
    
                779.56 msec task-clock:u              #    1.000 CPUs utilized          
                779.56 msec task-clock                #    1.000 CPUs utilized          
                     3      context-switches          #    0.004 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                     6      page-faults               #    0.008 K/sec                  
         2,104,778,670      cycles                    #    2.700 GHz                    
         2,008,110,142      branches                  # 2575.962 M/sec                  
         7,017,137,958      instructions              #    3.33  insn per cycle         
         5,217,161,206      uops_issued.any           # 6692.465 M/sec                  
         7,191,265,987      uops_executed.thread      # 9224.805 M/sec                  
           613,076,394      exe_activity.bound_on_stores #  786.442 M/sec                  
    
           0.779907034 seconds time elapsed
    
           0.779451000 seconds user
           0.000000000 seconds sys
    

    By chance, this happened to get exactly 2.7GHz. Usually there's some noise or startup overhead and it's a little lower. Note that 5217951928 front-end uops / 2106180524 cycles = ~2.48 average uops issued per cycle, out of a pipeline width of 4, so this is not low-throughput code. The instruction count is higher because of macro-fused compare/branch. (I could have unrolled more so even more of the instructions were stores, less add and branch, but I didn't.)

    (I re-ran the perf stat command a couple times so the CPU wasn't just waking from low-power sleep at the start of the timed interval. There are still page faults in the interval, but 6 page faults are negligible over a 3/4 second benchmark.)

    balance_performance: full 3.9GHz, top speed for this EPP

    No throttling based on what the code is doing.

    # sudo ... balance_performance
    $ taskset -c 3 perf stat -etask-clock:u,task-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,exe_activity.bound_on_stores -r1 ./"$t" 
    
     Performance counter stats for './testloop':
    
                539.83 msec task-clock:u              #    0.999 CPUs utilized          
                539.83 msec task-clock                #    0.999 CPUs utilized          
                     3      context-switches          #    0.006 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                     6      page-faults               #    0.011 K/sec                  
         2,105,328,671      cycles                    #    3.900 GHz                    
         2,008,030,096      branches                  # 3719.713 M/sec                  
         7,016,729,050      instructions              #    3.33  insn per cycle         
         5,217,686,004      uops_issued.any           # 9665.340 M/sec                  
         7,192,389,444      uops_executed.thread      # 13323.318 M/sec                 
           626,115,041      exe_activity.bound_on_stores # 1159.827 M/sec                  
    
           0.540108507 seconds time elapsed
    
           0.539877000 seconds user
           0.000000000 seconds sys
    

    About the same on a clock-for-clock basis, although slightly more total cycles where the store buffer was full. (That's between the core and L1d cache, not off core, so we'd expect about the same for the loop itself. Using -r10 to repeat 10 times, that number is stable +- 0.01% across runs.)

    performance: 4.2GHz, full turbo to the highest configured freq

    No throttling based on what the code is doing.

    # sudo ... performance
    taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread -r1 ./testloop
    
     Performance counter stats for './testloop':
    
                500.95 msec task-clock:u              #    1.000 CPUs utilized          
                500.95 msec task-clock                #    1.000 CPUs utilized          
                     0      context-switches          #    0.000 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                     7      page-faults               #    0.014 K/sec                  
         2,098,112,999      cycles                    #    4.188 GHz                    
         2,007,994,492      branches                  # 4008.380 M/sec                  
         7,016,551,461      instructions              #    3.34  insn per cycle         
         5,217,839,192      uops_issued.any           # 10415.906 M/sec                 
         7,192,116,174      uops_executed.thread      # 14356.978 M/sec                 
           624,662,664      exe_activity.bound_on_stores # 1246.958 M/sec                  
    
           0.501151045 seconds time elapsed
    
           0.501042000 seconds user
           0.000000000 seconds sys
    

    Overall performance scales linearly with clock speed, so this is a ~1.5x speedup vs. balance_power. (1.44 for balance_performance which has the same 3.9GHz full clock speed.)

    With buffers large enough to cause L1d or L2 cache misses, there's still a difference in core clock cycles.