linuxx86intelperfintel-pmu

L1-dcache-stores, LLC-stores, cache-references and uncore memory counter don't add up in Linux perf?


I am trying to measure memory bus related performance of a simple test program on an Intel N150 (Twin Lake, which has four Gracemont cores, like Alder Lake E-cores).

PMU counters from perf stat don't make complete sense. The L1-dcache and uncore counters make sense, cache-references a bit less, and LLC-[loads|stores] are just strange. I assumed that LLC-[load|store]-misses should be directly related to transactions on memory bus: an LLC miss should lead to an access to DRAM. But the counters don't show it at all. I also don't find the LLC events in /sys/, so I don't know which raw PMU events they are assigned to:

$ ls /sys/bus/event_source/devices/*/events/ | grep -i "llc"
$

The program simply initializes a large array of data (1GB), and runs a trivial calculation over it 32 or 64 times:

constexpr int N = 256'000'000;
unsigned int AData[N];

template <typename T>
T procItem(T item) {
  return item & 0b11011101011;
}

int main() {
  ...
  for (unsigned long long i=0; i<N; i++) {
    AData[i] = i;
  }

  constexpr unsigned n_proc = 32; // or 64
  for (int i=0; i<n_proc; i++) {
    for (auto& item : AData) {
        item += procItem(item);
    }
  }
}

I compile it without optimizations. And a run with n_proc=32 shows something like this:

g++ -std=c++23 test.cpp -o test
perf stat -e cache-references,cache-misses,L1-dcache-loads,LLC-loads,LLC-load-misses,L1-dcache-stores,LLC-stores,LLC-store-misses -- ./test

 Performance counter stats for './test':

       976 160 248      cache-references                                                        (50,00%)
       566 015 655      cache-misses                     #   57,98% of all cache refs           (62,50%)
    99 720 247 740      L1-dcache-loads                                                         (62,51%)
        11 976 479      LLC-loads                                                               (62,50%)
             6 361      LLC-load-misses                  #    0,05% of all LL-cache accesses    (62,50%)
    50 016 740 349      L1-dcache-stores                                                        (62,50%)
        11 492 008      LLC-stores                                                              (37,50%)
         6 866 101      LLC-store-misses                                                        (37,50%)

      22,284520679 seconds time elapsed

      20,959478000 seconds user
       0,320961000 seconds sys

It is already a bit strange here: how cache-references or -misses relate to L1-dcache-* or LLC-* counters?

A run with n_proc=64:

 Performance counter stats for './test':

     1 927 756 869      cache-references                                                        (50,00%)
     1 098 114 980      cache-misses                     #   56,96% of all cache refs           (62,50%)
   198 078 159 110      L1-dcache-loads                                                         (62,50%)
        17 748 504      LLC-loads                                                               (62,50%)
            10 305      LLC-load-misses                  #    0,06% of all LL-cache accesses    (62,50%)
    99 170 857 885      L1-dcache-stores                                                        (62,50%)
        11 529 003      LLC-stores                                                              (37,50%)
         6 786 904      LLC-store-misses                                                        (37,50%)

      47,097830995 seconds time elapsed

      45,767071000 seconds user
       0,322958000 seconds sys

cache-* and L1-dcache-* counters increases x2 times, as expected. But LLC-* no. Especially, LLC-stores are strange. They don't really change significantly.

Also to note, perf list has mem-loads and mem-stores events. But mem-loads always shows 0 count. And mem-stores counts are the same as L1-dcache-stores. (I cannot find L1-dcache events under /sys/bus/event_source/devices/*/events/, so cannot compare raw event and umask for sure.)

$ cat /sys/bus/event_source/devices/cpu/events/mem-loads 
event=0xd0,umask=0x5,ldlat=3
$ cat /sys/bus/event_source/devices/cpu/events/mem-stores 
event=0xd0,umask=0x6
$ uname -r
6.11.0-29-generic

Then, if I compile it with -Og, I get this:

n_proc=32
 Performance counter stats for './test':

       973 674 502      cache-references                                                        (50,01%)
       671 750 305      cache-misses                     #   68,99% of all cache refs           (62,51%)
    16 764 160 104      L1-dcache-loads                                                         (62,50%)
        58 190 499      LLC-loads                                                               (62,50%)
           560 785      LLC-load-misses                  #    0,96% of all LL-cache accesses    (62,50%)
    16 958 572 799      L1-dcache-stores                                                        (62,49%)
        11 632 355      LLC-stores                                                              (37,50%)
         6 481 939      LLC-store-misses                                                        (37,50%)

      11,281720398 seconds time elapsed

       9,968157000 seconds user
       0,311973000 seconds sys

n_proc=64
 Performance counter stats for './test':

     1 915 396 715      cache-references                                                        (50,00%)
     1 313 091 424      cache-misses                     #   68,55% of all cache refs           (62,50%)
    33 175 225 108      L1-dcache-loads                                                         (62,50%)
       115 378 508      LLC-loads                                                               (62,51%)
         1 089 098      LLC-load-misses                  #    0,94% of all LL-cache accesses    (62,50%)
    33 354 560 864      L1-dcache-stores                                                        (62,50%)
        12 073 424      LLC-stores                                                              (37,49%)
         6 552 391      LLC-store-misses                                                        (37,50%)

      21,374890682 seconds time elapsed

      20,049017000 seconds user
       0,318920000 seconds sys

L1-dcache decreased, as expected from more efficient code. But LLC-loads increased w.r.to the runs without -Og. LLC-loads do increase by factor of x2 from n_proc=32 to n_proc=64, which makes sense. But LLC-stores have not really changed.

Finally, I also run it with uncore events which measure DRAM CAS commands, i.e. actual memory bus transactions. In this case, perf stat has to run system-wide -a. Otherwise, uncore commands are <not supported>.

perf stat -e cache-references,cache-misses,L1-dcache-loads,LLC-loads,LLC-load-misses,L1-dcache-stores,LLC-stores,LLC-store-misses \
  -e unc_m_cas_count_rd,unc_m_cas_count_wr -e uncore_imc_free_running/data_read/ \
  -a -- ./test

With -Og compilation:

n_proc=32
 Performance counter stats for 'system wide':

       998 009 503      cache-references                                                        (49,99%)
       675 324 843      cache-misses                     #   67,67% of all cache refs           (62,50%)
    16 831 117 958      L1-dcache-loads                                                         (62,50%)
        61 496 976      LLC-loads                                                               (62,51%)
           556 737      LLC-load-misses                  #    0,91% of all LL-cache accesses    (62,51%)
    16 999 289 798      L1-dcache-stores                                                        (62,51%)
        12 125 538      LLC-stores                                                              (37,49%)
         6 425 956      LLC-store-misses                                                        (37,49%)
       547 439 524      unc_m_cas_count_rd                                                    
       528 958 625      unc_m_cas_count_wr                                                    
         33 413,03 MiB  uncore_imc_free_running/data_read/                                      

      11,515426638 seconds time elapsed


n_proc=64
 Performance counter stats for 'system wide':

     1 964 026 474      cache-references                                                        (50,01%)
     1 322 080 946      cache-misses                     #   67,31% of all cache refs           (62,51%)
    33 291 196 083      L1-dcache-loads                                                         (62,50%)
       122 590 187      LLC-loads                                                               (62,50%)
         1 083 470      LLC-load-misses                  #    0,88% of all LL-cache accesses    (62,50%)
    33 430 279 894      L1-dcache-stores                                                        (62,50%)
        13 117 422      LLC-stores                                                              (37,50%)
         6 436 536      LLC-store-misses                                                        (37,50%)
     1 077 224 939      unc_m_cas_count_rd                                                    
     1 041 069 003      unc_m_cas_count_wr                                                    
         65 748,53 MiB  uncore_imc_free_running/data_read/                                      

      21,641199259 seconds time elapsed

So, uncore CAS events also make sense. It looks like 1 CAS command corresponds to a transaction of 32 Bytes: 1G read + 1G write commands = 64GB of uncore_imc_free_running/data_read/. Is that correct?

Also, it looks like one L1-dcache-[load|store] means a load|store of one Byte: 33G L1-dcache-loads (most of which must miss) = 1G of 32-Byte unc_m_cas_count_rd. Is that correct? Does it depend on the register size in instructions or is it always counted per-byte?

Then, how to relate cache-references with L1-dcache-[loads|stores] and uncore counters? perf list says on one line that cache-references are Hardware event, and on another one Kernel PMU event. If it is a Kernel PMU event, could these counters just be somewhat unreliable? I.e. should cache-misses be equal to unc_m_cas_count_rd + unc_m_cas_count_wr? Or one cache-miss can trigger two memory transactions: a write and a read to DRAM together?

Finally, what to make of LLC-loads and especially LLC-stores? It seems like LLC-loads do mean something, just not clear how it relates to the other metrics. But LLC-stores are strange. I don't find these events under /sys/bus/event_source/devices/, but they are listed at the beginning of perf list:

$ perf list
  branch-instructions OR branches                    [Hardware event]
...

tool:
...

cache:
  L1-dcache-loads OR cpu/L1-dcache-loads/
  L1-dcache-stores OR cpu/L1-dcache-stores/
  L1-icache-loads OR cpu/L1-icache-loads/
  L1-icache-load-misses OR cpu/L1-icache-load-misses/
  LLC-loads OR cpu/LLC-loads/
  LLC-load-misses OR cpu/LLC-load-misses/
  LLC-stores OR cpu/LLC-stores/
  LLC-store-misses OR cpu/LLC-store-misses/
...

I also ran this program in VTune Memory Access analysis. The analysis shows the CAS counters for the memory bus bandwidth on the platform. It looks like VTune uses mem_uops_retired.all_[loads|stores] counters as Loads and Stores, and L1-dcache-* events are assigned to exactly the same thing.

perf stat -e L1-dcache-loads,L1-dcache-stores \
  -e mem_uops_retired.all_loads,mem_uops_retired.all_stores \
  -- ./test

 Performance counter stats for './test':

    33 190 440 300      L1-dcache-loads                                                       
    33 364 392 208      L1-dcache-stores                                                      
    33 190 440 300      mem_uops_retired.all_loads                                            
    33 364 392 208      mem_uops_retired.all_stores                                           

      31,416008016 seconds time elapsed

      29,959729000 seconds user
       0,444892000 seconds sys

Solution

  • The answer to the main question: LLC-[loads|stores] don't match the DRAM transactions (uncore CAS commands unc_m_cas_count_[rd|wr]) because they don't count all load and store events in LLC cache. They count only the "demand" events which are initiated directly by the program instructions. But the CPU manages the transactions to system memory mostly in out-of-order manner: the hardware prefetcher loads data from DRAM to caches, and the cache evictions trigger writebacks that store the data from caches to DRAM.

    Prefetch measurably improves the program execution time (by about 3x in this case if all HW prefetching is disabled). There are about the same number of DRAM transactions, but fewer demand misses, especially in code like this which runs slowly enough for HW prefetch to usually have data all the way to L2 or L1d before the CPU tries to access it. So the core sees mostly L1d hits and doesn't have to stall much. (Compiling with -O3 to use SIMD would make the code fast enough that memory can't keep up, result in demand loads having to wait for an already-started prefetch load, which still helps to keep the maximum number of loads in flight to get close to peak single-core memory bandwidth.) An update on the effect of the L2 and L1 prefetchers on DRAM transactions and execution time is at the end of the answer.

    Following @peter-cordes comments, I looked more into Intel perfmon events, tried disabling HW prefetcher via MSR register 0x1a4, and can confirm that's indeed how LLC-[loads|stores] work. The following are details of this investigation, in order of the three original questions.

    L1-dcache-[loads|stores] count all store and load instructions.

    I.e. instructions that access memory like mov (%rdi),%r11. I tried to change AData:

    //unsigned int AData[N];
    unsigned short AData[N];
    

    It results in the same count of L1-dcache since we compiled without auto-vectorization. But the memory-related counts decrease by x2 times since we only access half the total number of cache lines. (SIMD vectorization would access L1d cache in 16-byte chunks regardless of element width, and will typically run into a memory bandwidth bottleneck unless data is hot in L1d or maybe L2 cache.)

    To fiddle with instructions, I used g++ to dump assembly with debug lines (clang++ could not compile assembly with debug lines to the binary):

    # Makefile
    %.s: %.cpp
            g++ -std=c++23 -Og -g $< -S $@
            
    %: %.s  
            g++ -std=c++23 -Og -g $< -o $@
    

    (Some small gotchas here: a leaq AData(%rip), %r11 uses the same syntax as a memory access, but lea does not access memory. It only calculates the address. Also, most memory-destination instructions other than MOV, e.g. addl $1, (%rdi) are both a load and a store. On Atom cores, 256-bit load/store instructions decode to 2 load or store uops. Same for 80-bit x87 fldt/fstpt for long double)

    In the case of my program, there is only 1 load and 1 store inside the loop, on the item reference. But, since I call procItem for calculation and -Og doesn't inline, call/ret add 1 more load and 1 more store, of the return address:

    template <typename T>
    T procItem(T item) {
      return item & 0b11011101011;
    }
    
    constexpr unsigned n_proc = 32;
    for (int i=0; i<n_proc; i++) {
      for (auto& item : AData) {
        item += procItem(item);
        //item += item & 0b11011101011;
        // this shows twice less L1-dcache loads and stores
      }
    }
    

    When the return-address load and store are accounted for, L1-dcache counts add up exactly to the expected numbers.

    cache-references are indeed all LLC references

    cache-references = longest_lat_cache.references and cache-misses = longest_lat_cache.misses which:

    Counts the number of cacheable memory requests that miss in the Last Level Cache (LLC). Requests include demand loads, reads for ownership (RFO), instruction fetches and L1 HW prefetches. If the core has access to an L3 cache, the LLC is the L3 cache, otherwise it is the L2 cache.

    When the L2 HW prefetcher is turned off, these counts do become about the same as the transactions to DRAM. After disabling the HW prefetcher in BIOS:

    $ sudo rdmsr --all 0x1a4                                                           
    804
    ...
    # after BIOS change:
    $ sudo rdmsr --all 0x1a4                                                           
    805
    ...
    

    Register 0x1a4 is the MSR for prefetch control in the 12-13 gen P-cores: 2.17.5 in MSR manual version 88, 2025-06. But, I think it is the same in E-cores. (My E-core CPU family is 6_BEh, which does not have a section in the manual.) The manual "Table 2-47. MSRs Supported by 12th and 13th Generation Intel® Core™ Processor P-core" for 0x1a4:

    Register Address: Hex, Decimal Register Name
    Register Information / Bit Fields Bit Description
    Register Address: 1A4H, 420 MSR_PREFETCH_CONTROL
    0 L2_HARDWARE_PREFETCHER_DISABLE If 1, disables the L2 hardware prefetcher, which fetches additional lines of code or data into the L2 cache.
    1 L2_ADJACENT_CACHE_LINE_PREFETCHER_DISABLE If 1, disables the adjacent cache line prefetcher, which fetches the cache line that comprises a cache line pair (128 bytes)
    2 DCU_HARDWARE_PREFETCHER_DISABLE If 1, disables the L1 data cache prefetcher, which fetches the next cache line into L1 data cache.
    3 DCU_IP_PREFETCHER_DISABLE If 1, disables the L1 data cache IP prefetcher, which uses sequential load history (based on instruction pointer of previous loads) to determine whether to prefetch additional lines.

    So, 0x1a4 = 0x805 means that only L2 HW prefetcher is disabled, not the DCU i.e. L1 prefetcher.

    When the L2 prefetcher is disabled, longest_lat_cache.references (aka cache-references) get much closer to the counts of DRAM transactions:

    0x1a4 = 0x805:
    
    32 repetitions g++ -Og:
     Performance counter stats for 'system wide':
    
        17 143 131 658      L1-dcache-loads                                                         (62,49%)
        17 191 800 959      L1-dcache-stores                                                        (62,49%)
            10 693 254      LLC-loads                                                               (62,50%)
                     0      LLC-load-misses                                                         (62,50%)
            22 570 714      LLC-stores                                                              (25,00%)
                     0      LLC-store-misses                                                        (25,00%)
           558 915 095      cache-references                                                        (37,50%)
           538 758 089      cache-misses                     #   96,39% of all cache refs           (50,00%)
           614 170 946      unc_m_cas_count_rd
           560 849 764      unc_m_cas_count_wr
             37 485,92 MiB  uncore_imc_free_running/data_read/
    
          14,739593106 seconds time elapsed
    

    Here, the amount of data transferred to DRAM is larger than usual when the prefetcher is on: 37GB VS 33-34GB. So, when the HW prefetcher is ON, cache-references count is a bit higher, but there are less DRAM transactions. Not sure why exactly longest_lat_cache.references get higher, but it makes the system more efficient.

    LLC-[stores|loads] count only the "demand" stores and loads

    Firstly, LLC-loads is ocr.demand_data_rd.any_response and LLC-stores are ocr.demand_rfo.any_response. I conclude that these counters are the same by running the test with no other counts included, to have no mutexing, like this:

     Performance counter stats for './test':
    
           112 321 597      LLC-loads
            12 027 750      LLC-stores
           112 321 597      ocr.demand_data_rd.any_response 
            12 027 750      ocr.demand_rfo.any_response
        
          22,237195796 seconds time elapsed
    

    There was an odd case LLC-loads != ocr.demand_data_rd.any_response, when counting on the whole system:

     Performance counter stats for 'system wide':
    
           116 590 144      LLC-loads    
            12 267 280      LLC-stores       
           116 590 200      ocr.demand_data_rd.any_response
            12 267 280      ocr.demand_rfo.any_response
    
          21,175772075 seconds time elapsed
    

    Not sure what this means. Maybe that's some hiccup in how perf operates.

    These LLC-[loads|stores] are only the "demand" events, which (as Peter pointed out) are transactions that directly originate from the instructions, not from the prefetcher or the cache eviction & writebacks. The CPU manages the memory in out-of-order fashion with the prefetcher for loads and the writebacks for stores.

    In my case, the writeback stores should be counted by ocr.corewb_m.any_response:

     Performance counter stats for 'system wide':
    
            91 968 319      LLC-loads                                                               (50,00%)
            13 486 144      LLC-stores                                                              (50,00%)
             6 727 443      LLC-store-misses                                                        (50,00%)
         1 011 163 180      ocr.corewb_m.any_response                                               (50,00%)
         1 125 387 756      unc_m_cas_count_rd
         1 055 465 271      unc_m_prefetch_rd
         1 043 462 855      unc_m_cas_count_wr
    
          27,041476888 seconds time elapsed
    

    So, LLC-stores (or LLC-store-misses?) plus ocr.corewb_m.any_response should kind of add up to unc_m_cas_count_wr. And I think they were always close. But I did not pay a lot of attention to store writebacks.

    Now, it seems there is no good ocr.* event for the prefetched loads. There is unc_m_prefetch_rd from uncore events:

    MSR 0x1a4 = 804, the prefetchers are ON
    
     Performance counter stats for 'system wide':
    
             7 936 072      LLC-loads                                                               (50,00%)
             6 663 105      LLC-stores                                                              (50,00%)
             3 416 848      LLC-store-misses                                                        (50,00%)
           262 769 598      ocr.corewb_m.any_response                                               (50,00%)
           298 729 402      unc_m_cas_count_rd
           267 441 361      unc_m_prefetch_rd
           266 849 074      unc_m_cas_count_wr
    
          11,584572736 seconds time elapsed
    

    I thought that by disabling the prefetcher LLC-loads should get close to unc_m_cas_count_rd, and unc_m_prefetch_rd should become zero. But not so quickly:

    MSR 0x1a4 = 805, L2 prefetcher is off:
    
     Performance counter stats for 'system wide':
    
             1 608 878      LLC-loads                                                               (40,00%)
             1 613 284      ocr.demand_data_rd.any_response                                         (60,00%)
             8 752 439      LLC-stores                                                              (60,00%)
                     0      LLC-store-misses                                                        (40,00%)
           265 365 436      ocr.corewb_m.any_response                                               (40,00%)
           294 985 003      unc_m_cas_count_rd
                10 051      unc_m_prefetch_rd
           267 407 004      unc_m_cas_count_wr
    
          11,876177950 seconds time elapsed
    

    Here, I just was writing 0x807 into the register, and its value was being set to 0x805:

    $ sudo wrmsr --all 0x1a4 0x807
    $ sudo rdmsr --all 0x1a4
    805
    805
    805
    805
    

    Then I noticed that the default value 0x804 should be overwritten with 0x80f, and finally the prefetcher was completely off:

    $ sudo wrmsr --all 0x1a4 0x80f
    $ sudo rdmsr --all 0x1a4
    80d
    80d
    80d
    80d
    
     Performance counter stats for 'system wide':
    
           266 993 851      longest_lat_cache.miss                                                  (83,33%)
           269 460 818      longest_lat_cache.reference                                             (83,33%)
           253 769 648      LLC-loads                                                               (83,33%)
           253 716 493      ocr.demand_data_rd.any_response                                         (83,33%)
           253 067 848      ocr.demand_data_rd.l3_miss                                              (83,33%)
           253 067 530      ocr.demand_data_rd.l3_miss_local                                        (83,33%)
           328 065 224      unc_m_cas_count_rd
                10 325      unc_m_prefetch_rd
    
          34,463586191 seconds time elapsed
    

    Not sure why unc_m_prefetch_rd is not zero though.

    So, I think, it closes all the points. And, if one wants to do TMA-like analysis of memory transactions, it would look like this:


    Update on the DRAM transactions when the L2 & L1 prefetchers are ON or OFF.

    The typical performance (a median of execution time in 5 runs) looks like this for 64 repeats and -Og optimization:

    $ sudo rdmsr -a 0x1a4
    804
    ...
    
     Performance counter stats for 'system wide':
    
        33 214 476 658      L1-dcache-loads
        33 377 188 158      L1-dcache-stores
         1 916 345 898      longest_lat_cache.reference
         1 307 250 568      longest_lat_cache.miss
         1 042 810 743      unc_m_cas_count_wr
         1 102 344 008      unc_m_cas_count_rd
             67 281,69 MiB  uncore_imc_free_running/data_read/
             63 648,06 MiB  uncore_imc_free_running/data_write/
    
          22,583609093 seconds time elapsed
    

    With L2 prefetcher disabled:

    $ sudo rdmsr -a 0x1a4
    805
    ...
    
     Performance counter stats for 'system wide':
    
        33 238 834 224      L1-dcache-loads
        33 390 717 833      L1-dcache-stores
         1 056 842 798      longest_lat_cache.reference
         1 046 334 269      longest_lat_cache.miss
         1 042 839 732      unc_m_cas_count_wr
         1 097 031 629      unc_m_cas_count_rd
             66 957,32 MiB  uncore_imc_free_running/data_read/
             63 649,87 MiB  uncore_imc_free_running/data_write/
    
          26,643724536 seconds time elapsed
    

    With both L1 and L2 prefetchers disabled:

    $ sudo rdmsr -a 0x1a4
    80d
    
     Performance counter stats for 'system wide':
    
        33 332 447 582      L1-dcache-loads
        33 450 245 337      L1-dcache-stores
         1 060 023 042      longest_lat_cache.reference
         1 049 331 800      longest_lat_cache.miss
         1 043 207 792      unc_m_cas_count_wr
         1 168 724 853      unc_m_cas_count_rd
             71 333,15 MiB  uncore_imc_free_running/data_read/
             63 672,37 MiB  uncore_imc_free_running/data_write/
    
          71,155762499 seconds time elapsed
    

    These runs were done when logged into i3-wm (good documentation) on Ubuntu 24.04. Then I tried to log out into the Ubuntu login screen, connected via ssh and ran the same test. And I tried to log into Gnome, and ran the same test from ssh. In both cases, I see significantly larger background stream of DRAM reads. It looks like this:

     Performance counter stats for 'system wide':
    
        38 395 536 503      L1-dcache-loads 
        36 373 303 320      L1-dcache-stores
         1 173 051 207      longest_lat_cache.reference
         1 114 838 203      longest_lat_cache.miss
         1 068 149 338      unc_m_cas_count_wr
         2 502 982 468      unc_m_cas_count_rd
            152 769,70 MiB  uncore_imc_free_running/data_read/
             65 194,67 MiB  uncore_imc_free_running/data_write/
         
          72,310080444 seconds time elapsed
    

    I.e. the execution time is basically the same, but there are twice more reads. This stream of background reads remains the same when I physically power down the monitor. And it goes away, the uncore reads drop back to 69e3 MiB, when I am logged into i3. Powering down the monitor or running i3lock does not increase the uncore reads. And i3lock may decrease them by 1e3 MiB. Not sure about that.