I'm investigating the effects of cache on performance on x86-64 CPUs. I've been using Linux's perf to monitor cache hit/miss rates, particularly these counters:
mem_inst_retired.all_loads
mem_load_retired.l1_hit
mem_load_retired.l1_miss
I expect that all_loads ~= l1_hit + l1_miss
, because a load instruction can either hit or miss the L1 cache - no other option, because all regular loads go through the cache. I have read the Intel documentation here which doesn't say much and nothing that would disprove my thinking, as far as I can tell.
However, when running certain code, I notice that the sum is far below all_loads
.
For example, the following assembly, which sums 1 billion memory locations with stride 32B:
.intel_syntax noprefix
.globl main
main:
sub rsp, 8 # Stack alignment for call
push r12
push r13
mov r12, 0xfffffff # Array size
mov rdi, r12
call malloc # Allocate array
mov r13, rax # Save array pointer
mov rdi, r13 # Write to array to force page-in
mov rsi, 42
mov rdx, r12
call memset
mov rcx, 1000000000 # Loop counter
mov rdx, 0 # Index sequence start
mov rax, 0 # Result accumulator
.p2align 4 # Skylake JCC alignment issue
loop:
mov rdi, rdx
and rdi, r12 # Mask index to array size
movzx rsi, BYTE PTR [r13+rdi] # Read from array index
add rax, rsi
lea rdx, [rdx+32] # Generate next array index
dec rcx # Loop counter & condiiton
jnz loop
pop r13
pop r12
add rsp, 8
ret
It yields the following perf results:
~$ perf stat -e instructions,cycles,mem_inst_retired.all_loads,mem_load_retired.l1_hit,mem_load_retired.l1_miss ./test
Performance counter stats for './test':
7,000,215,742 instructions:u # 1.25 insn per cycle
5,589,048,737 cycles:u
998,622,922 mem_inst_retired.all_loads:u
17,215,080 mem_load_retired.l1_hit:u
424,118,595 mem_load_retired.l1_miss:u
1.939187022 seconds time elapsed
1.826889000 seconds user
0.112177000 seconds sys
all_loads
is approximately 1 billion as expected (albeit a little lower, maybe some sampling artefact though). However l1_hit + l1_miss
is about 450 million - seems like ~50% of loads have gone unaccounted for.
What is causing l1_hit
and l1_miss
to not sum to all_loads
?
Interestingly, if the memory load stride is varied such that almost all loads are hits or misses, the results tend towards all_loads ~= l1_hit + l1_miss
. It is only in the middle ground where the equality breaks down.
EDIT: I tested this on two CPUs: a Kaby Lake, and an Ice Lake. Both showed the same results.
As Margaret Bloom pointed out in comments, loads to the same cache line as an already-outstanding cache line can "hit" in that LFB, instead of allocating a new one. Turns out, that counts a neither a l1_hit
nor l1_miss
. And there's a separate event for it, mem_load_retired.fb_hit
. (It's probably good that l1_miss
only counts instructions that result in a new request to L2, rather than counting LFB hits as both. Also note that LFBs can be occupied by outgoing stores, including NT stores, so an LFB hit due to that is also possible; it's not always just due to multiple loads.)
Your code strides by 32 bytes, so does 2 loads per 64-byte line; the second one will usually be an LFB hit. (The first one might also be an LFB hit if hardware prefetching has already requested it, this probably explains having more LFB hits than misses.)
On my Skylake i7-6700k with this test program, mem_inst_retired.all_loads
is only greater than mem_load_retired.fb_hit + mem_load_retired.l1_hit + mem_load_retired.l1_miss
by about 0.6%.
So there's still a bit of a mystery of what the difference is, what's counted by mem_inst_retired.all_loads
but not any of the three more specific counters. I'd have expected them to be closer to exactly equal, especially with --all-user
or the :u
events so there isn't noise1 while counters are being programmed or collected.
With perf stat --no-big-num --all-user -e ...
for easy copy/paste of the numbers into calc
, I got hit+miss+LFB = 994.188M vs. all-loads = 999.958M counts in one run. So the sum is low by 0.58%. On repeated runs, this is pretty typical, the sum of the miss/hit/LFB counters being a small fraction lower than mem_inst_retired.all_loads.
A couple more runs:
$ perf stat --all-user --no-big-num -e task-clock,page-faults,instructions,cycles,mem_inst_retired.all_loads,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired.fb_hit ./a.out
Performance counter stats for './a.out':
1673.21 msec task-clock # 0.997 CPUs utilized
183 page-faults # 109.371 /sec
7000141672 instructions # 1.56 insn per cycle
4475942186 cycles # 2.675 GHz
999892622 mem_inst_retired.all_loads # 597.590 M/sec
10563966 mem_load_retired.l1_hit # 6.314 M/sec
449822478 mem_load_retired.l1_miss # 268.838 M/sec
533816318 mem_load_retired.fb_hit # 319.038 M/sec
1.677680356 seconds time elapsed
1.647640000 seconds user
0.023197000 seconds sys
$ perf stat --all-user --no-big-num -e task-clock,page-faults,instructions,cycles,mem_inst_retired.all_loads,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired.fb_hit ./a.out
Performance counter stats for './a.out':
1649.17 msec task-clock # 1.000 CPUs utilized
182 page-faults # 110.359 /sec
7000141486 instructions # 1.58 insn per cycle
4419739785 cycles # 2.680 GHz
999850616 mem_inst_retired.all_loads # 606.275 M/sec
9372903 mem_load_retired.l1_hit # 5.683 M/sec
450146459 mem_load_retired.l1_miss # 272.953 M/sec
534244404 mem_load_retired.fb_hit # 323.947 M/sec
1.649504255 seconds time elapsed
1.634275000 seconds user
0.013270000 seconds sys
(I normally run single-threaded tests under taskset -c 1
to ensure no cpu-migration
events, but that typically doesn't happen anyway for short runs on an idle system.)
My EPP /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference
) settings is balance-performance
(not full performance
), so hardware P-state management clocked down to 2.7GHz on this memory-bound workload. The calculated 2.68GHz only counts user-space cycles because of --all-user
, but task-clock
is wall-clock time. (This somewhat reduces per-core memory bandwidth since the uncore slows down, making latency x max_in-flight_lines a limiting factor in single-core memory bandwidth. This isn't a problem for this experiment, but it's something else non-obvious that's visible in this perf data. My i7-6700k has dual channel DDR4-2666, running Arch Linux, kernel 6.4.9)
Footnote 1: even --all-user
isn't perfect. @John McCalpin commented:
The details vary by implementation, but there are lots and lots of little gotchas when trying to make performance counts accurate in the presence of unnecessary user/kernel crossings. Ice Lake Xeon will undercount if the counters are configured for user-only or kernel-only (errata ICX14). Going through the kernel is OK for coarse measurements (~10%), but for detailed studies of the consistency of different events it is best to avoid leaving user-space.
You'd do this by collecting counts in user-space via rdpmc
, after getting the kernel to program the counters. (Perhaps via perf_event_open
.)
Avoiding interrupts on that core could be done via short test intervals, otherwise you'd want to look into isolcpus=
Linux kernel boot options so you can test for longer than a timeslice while still avoiding any user/kernel transitions during the timed / profiled region.