cachingx86x86-64cpu-architectureperf

Why do mem_load_retired.l1_hit and mem_load_retired.l1_miss not add to the total number of loads?


I'm investigating the effects of cache on performance on x86-64 CPUs. I've been using Linux's perf to monitor cache hit/miss rates, particularly these counters:

I expect that all_loads ~= l1_hit + l1_miss, because a load instruction can either hit or miss the L1 cache - no other option, because all regular loads go through the cache. I have read the Intel documentation here which doesn't say much and nothing that would disprove my thinking, as far as I can tell.

However, when running certain code, I notice that the sum is far below all_loads.
For example, the following assembly, which sums 1 billion memory locations with stride 32B:

.intel_syntax noprefix
.globl main
main:
    sub     rsp, 8     # Stack alignment for call
    push    r12
    push    r13
    mov     r12, 0xfffffff     # Array size

    mov     rdi, r12
    call    malloc              # Allocate array
    mov     r13, rax          # Save array pointer

    mov     rdi, r13          # Write to array to force page-in
    mov     rsi, 42
    mov     rdx, r12
    call    memset

    mov     rcx, 1000000000    # Loop counter
    mov     rdx, 0             # Index sequence start
    mov     rax, 0             # Result accumulator

.p2align 4      # Skylake JCC alignment issue
loop:
    mov     rdi, rdx
    and     rdi, r12      # Mask index to array size
    movzx   rsi, BYTE PTR [r13+rdi]   # Read from array index
    add     rax, rsi

    lea     rdx, [rdx+32] # Generate next array index
    dec     rcx        # Loop counter & condiiton
    jnz     loop

    pop     r13
    pop     r12
    add     rsp, 8
    ret

It yields the following perf results:

~$ perf stat -e instructions,cycles,mem_inst_retired.all_loads,mem_load_retired.l1_hit,mem_load_retired.l1_miss ./test

 Performance counter stats for './test':

     7,000,215,742      instructions:u                   #    1.25  insn per cycle            
     5,589,048,737      cycles:u                                                              
       998,622,922      mem_inst_retired.all_loads:u                                          
        17,215,080      mem_load_retired.l1_hit:u                                             
       424,118,595      mem_load_retired.l1_miss:u                                            

       1.939187022 seconds time elapsed

       1.826889000 seconds user
       0.112177000 seconds sys

all_loads is approximately 1 billion as expected (albeit a little lower, maybe some sampling artefact though). However l1_hit + l1_miss is about 450 million - seems like ~50% of loads have gone unaccounted for.

What is causing l1_hit and l1_miss to not sum to all_loads?

Interestingly, if the memory load stride is varied such that almost all loads are hits or misses, the results tend towards all_loads ~= l1_hit + l1_miss. It is only in the middle ground where the equality breaks down.

EDIT: I tested this on two CPUs: a Kaby Lake, and an Ice Lake. Both showed the same results.


Solution

  • As Margaret Bloom pointed out in comments, loads to the same cache line as an already-outstanding cache line can "hit" in that LFB, instead of allocating a new one. Turns out, that counts a neither a l1_hit nor l1_miss. And there's a separate event for it, mem_load_retired.fb_hit. (It's probably good that l1_miss only counts instructions that result in a new request to L2, rather than counting LFB hits as both. Also note that LFBs can be occupied by outgoing stores, including NT stores, so an LFB hit due to that is also possible; it's not always just due to multiple loads.)

    Your code strides by 32 bytes, so does 2 loads per 64-byte line; the second one will usually be an LFB hit. (The first one might also be an LFB hit if hardware prefetching has already requested it, this probably explains having more LFB hits than misses.)

    On my Skylake i7-6700k with this test program, mem_inst_retired.all_loads is only greater than mem_load_retired.fb_hit + mem_load_retired.l1_hit + mem_load_retired.l1_miss by about 0.6%.

    So there's still a bit of a mystery of what the difference is, what's counted by mem_inst_retired.all_loads but not any of the three more specific counters. I'd have expected them to be closer to exactly equal, especially with --all-user or the :u events so there isn't noise1 while counters are being programmed or collected.

    With perf stat --no-big-num --all-user -e ... for easy copy/paste of the numbers into calc, I got hit+miss+LFB = 994.188M vs. all-loads = 999.958M counts in one run. So the sum is low by 0.58%. On repeated runs, this is pretty typical, the sum of the miss/hit/LFB counters being a small fraction lower than mem_inst_retired.all_loads.

    A couple more runs:

    $ perf stat --all-user --no-big-num -e task-clock,page-faults,instructions,cycles,mem_inst_retired.all_loads,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired.fb_hit ./a.out
    
     Performance counter stats for './a.out':
    
               1673.21 msec task-clock                       #    0.997 CPUs utilized             
                   183      page-faults                      #  109.371 /sec                      
            7000141672      instructions                     #    1.56  insn per cycle            
            4475942186      cycles                           #    2.675 GHz                       
             999892622      mem_inst_retired.all_loads       #  597.590 M/sec                     
              10563966      mem_load_retired.l1_hit          #    6.314 M/sec                     
             449822478      mem_load_retired.l1_miss         #  268.838 M/sec                     
             533816318      mem_load_retired.fb_hit          #  319.038 M/sec                     
    
           1.677680356 seconds time elapsed
    
           1.647640000 seconds user
           0.023197000 seconds sys
    
    
    $ perf stat --all-user --no-big-num -e task-clock,page-faults,instructions,cycles,mem_inst_retired.all_loads,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired.fb_hit ./a.out
    
     Performance counter stats for './a.out':
    
               1649.17 msec task-clock                       #    1.000 CPUs utilized             
                   182      page-faults                      #  110.359 /sec                      
            7000141486      instructions                     #    1.58  insn per cycle            
            4419739785      cycles                           #    2.680 GHz                       
             999850616      mem_inst_retired.all_loads       #  606.275 M/sec                     
               9372903      mem_load_retired.l1_hit          #    5.683 M/sec                     
             450146459      mem_load_retired.l1_miss         #  272.953 M/sec                     
             534244404      mem_load_retired.fb_hit          #  323.947 M/sec                     
    
           1.649504255 seconds time elapsed
    
           1.634275000 seconds user
           0.013270000 seconds sys
    

    (I normally run single-threaded tests under taskset -c 1 to ensure no cpu-migration events, but that typically doesn't happen anyway for short runs on an idle system.)

    My EPP /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference) settings is balance-performance (not full performance), so hardware P-state management clocked down to 2.7GHz on this memory-bound workload. The calculated 2.68GHz only counts user-space cycles because of --all-user, but task-clock is wall-clock time. (This somewhat reduces per-core memory bandwidth since the uncore slows down, making latency x max_in-flight_lines a limiting factor in single-core memory bandwidth. This isn't a problem for this experiment, but it's something else non-obvious that's visible in this perf data. My i7-6700k has dual channel DDR4-2666, running Arch Linux, kernel 6.4.9)

    Footnote 1: even --all-user isn't perfect. @John McCalpin commented:

    The details vary by implementation, but there are lots and lots of little gotchas when trying to make performance counts accurate in the presence of unnecessary user/kernel crossings. Ice Lake Xeon will undercount if the counters are configured for user-only or kernel-only (errata ICX14). Going through the kernel is OK for coarse measurements (~10%), but for detailed studies of the consistency of different events it is best to avoid leaving user-space.

    You'd do this by collecting counts in user-space via rdpmc, after getting the kernel to program the counters. (Perhaps via perf_event_open.)

    Avoiding interrupts on that core could be done via short test intervals, otherwise you'd want to look into isolcpus= Linux kernel boot options so you can test for longer than a timeslice while still avoiding any user/kernel transitions during the timed / profiled region.