I stressed my system to see how it affects some program i wrote using stress-ng.
The program itself is a neural network, mainly composed of some nested loops doing some multiplication and using about 1G of RAM overall coded in C++.
I imposed some memory stress on the system using:
stress-ng --vm 4 --vm-bytes 2G -t 100s
which creates 4 workers spinning on mmap allocating 2G of RAM each. This slows down the execution of my program significantly (from about 150ms to 250ms). But the reason for the program to slow down is not lack of memory or memory-bandwidth or something. Instead the CPU cycles decrease from 3.4GHz (without stress-ng) to 2.8GHz (with stress-ng). The CPU utilization stays about the same (99%), as expected.
I measured the CPU frequency using
sudo perf stat -B ./my_program
Does anybody know why memory stress slows down the CPU?
My CPU is an Intel(R) Core(TM) i5-8250U and my OS is Ubuntu 18.04.
kind regards lpolari
Skylake-derived CPUs do lower their core clock speed when bottlenecked on load / stores, at energy vs. performance settings that favour more powersaving. Surprisingly, you can construct artificial cases where this downclocking happens even with stores that all hit in L1d cache, or loads from uninitialized memory (still CoW mapped to the same zero pages).
Skylake introduced full hardware control of CPU frequency (hardware P-state = HWP). https://unix.stackexchange.com/questions/439340/what-are-the-implications-of-setting-the-cpu-governor-to-performance Frequency decision can take into account internal performance-monitoring which can notice things like spending most cycles stalled, or what it's stalled on. I don't know what heuristic exactly Skylake uses.
You can repro this1 by looping over a large array without making any system calls. If it's large (or you stride through cache lines in an artificial test), perf stat ./a.out
will show the average clock speed is lower than for normal CPU-bound loops.
In theory, if memory is totally not keeping up with the CPU, lowering the core clock speed (and holding memory controller constant) shouldn't hurt performance much. In practice, lowering the clock speed also lowers the uncore clock speed (ring bus + L3 cache), somewhat worsening memory latency and bandwidth as well.
Part of the latency of a cache miss is getting the request from the CPU core to the memory controller, and single-core bandwidth is limited by max concurrency (outstanding requests one core can track) / latency. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?
e.g. my i7-6700k drops from 3.9GHz to 2.7GHz when running a microbenchmark that only bottlenecks on DRAM at default bootup settings. (Also it only goes up to 3.9GHz instead of 4.0 all-core or 4.2GHz with 1 or 2 cores active as configured in the BIOS, with the default balance_power
EPP settings on boot or with balance_performance
.)
This default doesn't seem very good, too conservative for "client" chips where a single core can nearly saturate DRAM bandwidth, but only at full clock speed. Or too aggressive about powersaving, if you look at it from the other POV, especially for chips like my desktop with a high TDP (95W) that can sustain full clock speed indefinitely even when running power-hungry stuff like x265 video encoding making heavy use of AVX2.
It might make more sense with a ULV 15W chip like your i5-8250U to try to leave more thermal / power headroom for when the CPU is doing something more interesting.
This is governed by their Energy / Performance Preference (EPP) setting. It happens fairly strongly at the default balance_power
setting. It doesn't happen at all at full performance
, and some quick benchmarks indicate that balance_performance
also avoids this powersaving slowdown. I use balance_performance
on my desktop.
"Client" (non-Xeon) chips before Ice Lake have all cores locked together so they run at the same clock speed (and will all run higher if even one of them is running something not memory bound, like a while(1) { _mm_pause(); }
loop). But there's still an EPP setting for every logical core. I've always just changed the settings for all cores to keep them the same:
On Linux, reading the settings:
$ grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference:balance_performance
/sys/devices/system/cpu/cpufreq/policy1/energy_performance_preference:balance_performance
...
/sys/devices/system/cpu/cpufreq/policy7/energy_performance_preference:balance_performance
Writing the settings:
echo balance_performance | sudo tee /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
See also
x86_energy_perf_policy(8)
man pageStore 1 dword per cache line, advancing through contiguous cache lines until end of buffer, then wrapping the pointer back to the start. Repeat for a fixed number of stores, regardless of buffer size.
;; t=testloop; nasm -felf64 "$t.asm" && ld "$t.o" -o "$t" && taskset -c 3 perf stat -d -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread ./"$t"
;; nasm -felf64 testloop.asm
;; ld -o testloop testloop.o
;; taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread -r1 ./testloop
; or idq.mite_uops
default rel
%ifdef __YASM_VER__
; CPU intelnop
; CPU Conroe AMD
CPU Skylake AMD
%else
%use smartalign
alignmode p6, 64
%endif
global _start
_start:
lea rdi, [buf]
lea rsi, [endbuf]
; mov rsi, qword endbuf ; large buffer. NASM / YASM can't actually handle a huge BSS and hit a failed assert (NASM) or make a binary that doesn't reserve enough BSS space.
mov ebp, 1000000000
align 64
.loop:
%if 0
mov eax, [rdi] ; LOAD
mov eax, [rdi+64]
%else
mov [rdi], eax ; STORE
mov [rdi+64], eax
%endif
add rdi, 128
cmp rdi, rsi
jae .wrap_ptr ; normally falls through, total loop = 4 fused-domain uops
.back:
dec ebp
jnz .loop
.end:
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
.wrap_ptr:
lea rdi, [buf]
jmp .back
section .bss
align 4096
;buf: resb 2048*1024*1024 - 1024*1024 ; just under 2GiB so RIP-rel still works
buf: resb 1024*1024 / 64 ; 16kiB = half of L1d
endbuf:
resb 4096 ; spare space to allow overshoot
Test system: Arch GNU/Linux, kernel 5.7.6-arch1-1. (And NASM 2.14.02, ld
from GNU Binutils 2.34.0).
balance_power
, which only ever goes up to 3.9GHz. My boot script changes to balance_pwerformance
which still only goes to 3.9GHz so fans stay quiet, but is less conservative.Hyperthreading is enabled, but the system is idle and the kernel won't schedule anything on the other logical core (the sibling of the one I pinned it to), so it has a physical core to itself.
However, this means perf is unwilling to use more programmable perf counters for one thread, so perf stat -d
to monitor L1d loads and replacement, and L3 hit / miss would mean less accurate measuring for cycles
and so on. It's negligible, like 424k L1-dcache-loads (probably in kernel page-fault handlers, interrupt handlers, and other overhead, because the loop has no loads). L1-dcache-load-misses
is actually L1D.REPLACEMENT
and is even lower, like 48k
I used a few perf events, including exe_activity.bound_on_stores
-[Cycles where the Store Buffer was full and no outstanding load]. (See perf list
for descriptions, and/or Intel's manuals for more).
balance_power
: 2.7GHz downclock out of 3.9GHzEPP setting: balance_power
with sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_power > "$i";done'
There is throttling based on what the code is doing; with a pause loop on another core keeping clocks high, this would run faster on this code. Or with different instructions in the loop.
# sudo ... balance_power
$ taskset -c 3 perf stat -etask-clock:u,task-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,exe_activity.bound_on_stores -r1 ./"$t"
Performance counter stats for './testloop':
779.56 msec task-clock:u # 1.000 CPUs utilized
779.56 msec task-clock # 1.000 CPUs utilized
3 context-switches # 0.004 K/sec
0 cpu-migrations # 0.000 K/sec
6 page-faults # 0.008 K/sec
2,104,778,670 cycles # 2.700 GHz
2,008,110,142 branches # 2575.962 M/sec
7,017,137,958 instructions # 3.33 insn per cycle
5,217,161,206 uops_issued.any # 6692.465 M/sec
7,191,265,987 uops_executed.thread # 9224.805 M/sec
613,076,394 exe_activity.bound_on_stores # 786.442 M/sec
0.779907034 seconds time elapsed
0.779451000 seconds user
0.000000000 seconds sys
By chance, this happened to get exactly 2.7GHz. Usually there's some noise or startup overhead and it's a little lower. Note that 5217951928 front-end uops / 2106180524 cycles = ~2.48 average uops issued per cycle, out of a pipeline width of 4, so this is not low-throughput code. The instruction count is higher because of macro-fused compare/branch. (I could have unrolled more so even more of the instructions were stores, less add and branch, but I didn't.)
(I re-ran the perf stat
command a couple times so the CPU wasn't just waking from low-power sleep at the start of the timed interval. There are still page faults in the interval, but 6 page faults are negligible over a 3/4 second benchmark.)
balance_performance
: full 3.9GHz, top speed for this EPPNo throttling based on what the code is doing.
# sudo ... balance_performance
$ taskset -c 3 perf stat -etask-clock:u,task-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,exe_activity.bound_on_stores -r1 ./"$t"
Performance counter stats for './testloop':
539.83 msec task-clock:u # 0.999 CPUs utilized
539.83 msec task-clock # 0.999 CPUs utilized
3 context-switches # 0.006 K/sec
0 cpu-migrations # 0.000 K/sec
6 page-faults # 0.011 K/sec
2,105,328,671 cycles # 3.900 GHz
2,008,030,096 branches # 3719.713 M/sec
7,016,729,050 instructions # 3.33 insn per cycle
5,217,686,004 uops_issued.any # 9665.340 M/sec
7,192,389,444 uops_executed.thread # 13323.318 M/sec
626,115,041 exe_activity.bound_on_stores # 1159.827 M/sec
0.540108507 seconds time elapsed
0.539877000 seconds user
0.000000000 seconds sys
About the same on a clock-for-clock basis, although slightly more total cycles where the store buffer was full. (That's between the core and L1d cache, not off core, so we'd expect about the same for the loop itself. Using -r10
to repeat 10 times, that number is stable +- 0.01% across runs.)
performance
: 4.2GHz, full turbo to the highest configured freqNo throttling based on what the code is doing.
# sudo ... performance
taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread -r1 ./testloop
Performance counter stats for './testloop':
500.95 msec task-clock:u # 1.000 CPUs utilized
500.95 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
7 page-faults # 0.014 K/sec
2,098,112,999 cycles # 4.188 GHz
2,007,994,492 branches # 4008.380 M/sec
7,016,551,461 instructions # 3.34 insn per cycle
5,217,839,192 uops_issued.any # 10415.906 M/sec
7,192,116,174 uops_executed.thread # 14356.978 M/sec
624,662,664 exe_activity.bound_on_stores # 1246.958 M/sec
0.501151045 seconds time elapsed
0.501042000 seconds user
0.000000000 seconds sys
Overall performance scales linearly with clock speed, so this is a ~1.5x speedup vs. balance_power
. (1.44 for balance_performance
which has the same 3.9GHz full clock speed.)
With buffers large enough to cause L1d or L2 cache misses, there's still a difference in core clock cycles.