I am trying to measure memory bus related performance of a simple test program on an Intel N150 (Twin Lake, which has four Gracemont cores, like Alder Lake E-cores).
PMU counters from perf stat
don't make complete sense. The L1-dcache
and uncore counters make sense, cache-references
a bit less, and LLC-[loads|stores]
are just strange. I assumed that LLC-[load|store]-misses
should be directly related to transactions on memory bus: an LLC miss should lead to an access to DRAM. But the counters don't show it at all. I also don't find the LLC
events in /sys/
, so I don't know which raw PMU events they are assigned to:
$ ls /sys/bus/event_source/devices/*/events/ | grep -i "llc"
$
The program simply initializes a large array of data (1GB), and runs a trivial calculation over it 32 or 64 times:
constexpr int N = 256'000'000;
unsigned int AData[N];
template <typename T>
T procItem(T item) {
return item & 0b11011101011;
}
int main() {
...
for (unsigned long long i=0; i<N; i++) {
AData[i] = i;
}
constexpr unsigned n_proc = 32; // or 64
for (int i=0; i<n_proc; i++) {
for (auto& item : AData) {
item += procItem(item);
}
}
}
I compile it without optimizations. And a run with n_proc=32
shows something like this:
g++ -std=c++23 test.cpp -o test
perf stat -e cache-references,cache-misses,L1-dcache-loads,LLC-loads,LLC-load-misses,L1-dcache-stores,LLC-stores,LLC-store-misses -- ./test
Performance counter stats for './test':
976 160 248 cache-references (50,00%)
566 015 655 cache-misses # 57,98% of all cache refs (62,50%)
99 720 247 740 L1-dcache-loads (62,51%)
11 976 479 LLC-loads (62,50%)
6 361 LLC-load-misses # 0,05% of all LL-cache accesses (62,50%)
50 016 740 349 L1-dcache-stores (62,50%)
11 492 008 LLC-stores (37,50%)
6 866 101 LLC-store-misses (37,50%)
22,284520679 seconds time elapsed
20,959478000 seconds user
0,320961000 seconds sys
It is already a bit strange here: how cache-references
or -misses
relate to L1-dcache-*
or LLC-*
counters?
A run with n_proc=64
:
Performance counter stats for './test':
1 927 756 869 cache-references (50,00%)
1 098 114 980 cache-misses # 56,96% of all cache refs (62,50%)
198 078 159 110 L1-dcache-loads (62,50%)
17 748 504 LLC-loads (62,50%)
10 305 LLC-load-misses # 0,06% of all LL-cache accesses (62,50%)
99 170 857 885 L1-dcache-stores (62,50%)
11 529 003 LLC-stores (37,50%)
6 786 904 LLC-store-misses (37,50%)
47,097830995 seconds time elapsed
45,767071000 seconds user
0,322958000 seconds sys
cache-*
and L1-dcache-*
counters increases x2 times, as expected. But LLC-*
no. Especially, LLC-stores
are strange. They don't really change significantly.
Also to note, perf list
has mem-loads
and mem-stores
events. But mem-loads
always shows 0 count. And mem-stores
counts are the same as L1-dcache-stores
. (I cannot find L1-dcache
events under /sys/bus/event_source/devices/*/events/
, so cannot compare raw event and umask for sure.)
$ cat /sys/bus/event_source/devices/cpu/events/mem-loads
event=0xd0,umask=0x5,ldlat=3
$ cat /sys/bus/event_source/devices/cpu/events/mem-stores
event=0xd0,umask=0x6
$ uname -r
6.11.0-29-generic
Then, if I compile it with -Og
, I get this:
n_proc=32
Performance counter stats for './test':
973 674 502 cache-references (50,01%)
671 750 305 cache-misses # 68,99% of all cache refs (62,51%)
16 764 160 104 L1-dcache-loads (62,50%)
58 190 499 LLC-loads (62,50%)
560 785 LLC-load-misses # 0,96% of all LL-cache accesses (62,50%)
16 958 572 799 L1-dcache-stores (62,49%)
11 632 355 LLC-stores (37,50%)
6 481 939 LLC-store-misses (37,50%)
11,281720398 seconds time elapsed
9,968157000 seconds user
0,311973000 seconds sys
n_proc=64
Performance counter stats for './test':
1 915 396 715 cache-references (50,00%)
1 313 091 424 cache-misses # 68,55% of all cache refs (62,50%)
33 175 225 108 L1-dcache-loads (62,50%)
115 378 508 LLC-loads (62,51%)
1 089 098 LLC-load-misses # 0,94% of all LL-cache accesses (62,50%)
33 354 560 864 L1-dcache-stores (62,50%)
12 073 424 LLC-stores (37,49%)
6 552 391 LLC-store-misses (37,50%)
21,374890682 seconds time elapsed
20,049017000 seconds user
0,318920000 seconds sys
L1-dcache
decreased, as expected from more efficient code. But LLC-loads
increased w.r.to the runs without -Og
. LLC-loads
do increase by factor of x2 from n_proc=32
to n_proc=64
, which makes sense. But LLC-stores
have not really changed.
Finally, I also run it with uncore events which measure DRAM CAS commands, i.e. actual memory bus transactions. In this case, perf stat
has to run system-wide -a
. Otherwise, uncore commands are <not supported>
.
perf stat -e cache-references,cache-misses,L1-dcache-loads,LLC-loads,LLC-load-misses,L1-dcache-stores,LLC-stores,LLC-store-misses \
-e unc_m_cas_count_rd,unc_m_cas_count_wr -e uncore_imc_free_running/data_read/ \
-a -- ./test
With -Og
compilation:
n_proc=32
Performance counter stats for 'system wide':
998 009 503 cache-references (49,99%)
675 324 843 cache-misses # 67,67% of all cache refs (62,50%)
16 831 117 958 L1-dcache-loads (62,50%)
61 496 976 LLC-loads (62,51%)
556 737 LLC-load-misses # 0,91% of all LL-cache accesses (62,51%)
16 999 289 798 L1-dcache-stores (62,51%)
12 125 538 LLC-stores (37,49%)
6 425 956 LLC-store-misses (37,49%)
547 439 524 unc_m_cas_count_rd
528 958 625 unc_m_cas_count_wr
33 413,03 MiB uncore_imc_free_running/data_read/
11,515426638 seconds time elapsed
n_proc=64
Performance counter stats for 'system wide':
1 964 026 474 cache-references (50,01%)
1 322 080 946 cache-misses # 67,31% of all cache refs (62,51%)
33 291 196 083 L1-dcache-loads (62,50%)
122 590 187 LLC-loads (62,50%)
1 083 470 LLC-load-misses # 0,88% of all LL-cache accesses (62,50%)
33 430 279 894 L1-dcache-stores (62,50%)
13 117 422 LLC-stores (37,50%)
6 436 536 LLC-store-misses (37,50%)
1 077 224 939 unc_m_cas_count_rd
1 041 069 003 unc_m_cas_count_wr
65 748,53 MiB uncore_imc_free_running/data_read/
21,641199259 seconds time elapsed
So, uncore CAS events also make sense. It looks like 1 CAS command corresponds to a transaction of 32 Bytes: 1G read + 1G write commands = 64GB of uncore_imc_free_running/data_read/
. Is that correct?
Also, it looks like one L1-dcache-[load|store]
means a load|store of one Byte: 33G L1-dcache-loads
(most of which must miss) = 1G of 32-Byte unc_m_cas_count_rd
. Is that correct? Does it depend on the register size in instructions or is it always counted per-byte?
Then, how to relate cache-references
with L1-dcache-[loads|stores]
and uncore counters? perf list
says on one line that cache-references
are Hardware event
, and on another one Kernel PMU event
. If it is a Kernel PMU event
, could these counters just be somewhat unreliable? I.e. should cache-misses
be equal to unc_m_cas_count_rd
+ unc_m_cas_count_wr
? Or one cache-miss
can trigger two memory transactions: a write and a read to DRAM together?
Finally, what to make of LLC-loads
and especially LLC-stores
? It seems like LLC-loads
do mean something, just not clear how it relates to the other metrics. But LLC-stores
are strange. I don't find these events under /sys/bus/event_source/devices/
, but they are listed at the beginning of perf list
:
$ perf list
branch-instructions OR branches [Hardware event]
...
tool:
...
cache:
L1-dcache-loads OR cpu/L1-dcache-loads/
L1-dcache-stores OR cpu/L1-dcache-stores/
L1-icache-loads OR cpu/L1-icache-loads/
L1-icache-load-misses OR cpu/L1-icache-load-misses/
LLC-loads OR cpu/LLC-loads/
LLC-load-misses OR cpu/LLC-load-misses/
LLC-stores OR cpu/LLC-stores/
LLC-store-misses OR cpu/LLC-store-misses/
...
I also ran this program in VTune Memory Access analysis. The analysis shows the CAS counters for the memory bus bandwidth on the platform. It looks like VTune uses mem_uops_retired.all_[loads|stores]
counters as Loads and Stores, and L1-dcache-*
events are assigned to exactly the same thing.
perf stat -e L1-dcache-loads,L1-dcache-stores \
-e mem_uops_retired.all_loads,mem_uops_retired.all_stores \
-- ./test
Performance counter stats for './test':
33 190 440 300 L1-dcache-loads
33 364 392 208 L1-dcache-stores
33 190 440 300 mem_uops_retired.all_loads
33 364 392 208 mem_uops_retired.all_stores
31,416008016 seconds time elapsed
29,959729000 seconds user
0,444892000 seconds sys
The answer to the main question: LLC-[loads|stores]
don't match the DRAM transactions (uncore CAS commands unc_m_cas_count_[rd|wr]
) because they don't count all load and store events in LLC cache. They count only the "demand" events which are initiated directly by the program instructions. But the CPU manages the transactions to system memory mostly in out-of-order manner: the hardware prefetcher loads data from DRAM to caches, and the cache evictions trigger writebacks that store the data from caches to DRAM.
Prefetch measurably improves the program execution time (by about 3x in this case if all HW prefetching is disabled). There are about the same number of DRAM transactions, but fewer demand misses, especially in code like this which runs slowly enough for HW prefetch to usually have data all the way to L2 or L1d before the CPU tries to access it. So the core sees mostly L1d hits and doesn't have to stall much. (Compiling with -O3
to use SIMD would make the code fast enough that memory can't keep up, result in demand loads having to wait for an already-started prefetch load, which still helps to keep the maximum number of loads in flight to get close to peak single-core memory bandwidth.) An update on the effect of the L2 and L1 prefetchers on DRAM transactions and execution time is at the end of the answer.
Following @peter-cordes comments, I looked more into Intel perfmon events, tried disabling HW prefetcher via MSR register 0x1a4
, and can confirm that's indeed how LLC-[loads|stores]
work. The following are details of this investigation, in order of the three original questions.
L1-dcache-[loads|stores]
count all store and load instructions.I.e. instructions that access memory like mov (%rdi),%r11
. I tried to change AData
:
//unsigned int AData[N];
unsigned short AData[N];
It results in the same count of L1-dcache
since we compiled without auto-vectorization. But the memory-related counts decrease by x2 times since we only access half the total number of cache lines. (SIMD vectorization would access L1d cache in 16-byte chunks regardless of element width, and will typically run into a memory bandwidth bottleneck unless data is hot in L1d or maybe L2 cache.)
To fiddle with instructions, I used g++
to dump assembly with debug lines (clang++
could not compile assembly with debug lines to the binary):
# Makefile
%.s: %.cpp
g++ -std=c++23 -Og -g $< -S $@
%: %.s
g++ -std=c++23 -Og -g $< -o $@
(Some small gotchas here: a leaq AData(%rip), %r11
uses the same syntax as a memory access, but lea
does not access memory. It only calculates the address. Also, most memory-destination instructions other than MOV, e.g. addl $1, (%rdi)
are both a load and a store. On Atom cores, 256-bit load/store instructions decode to 2 load or store uops. Same for 80-bit x87 fldt
/fstpt
for long double
)
In the case of my program, there is only 1 load and 1 store inside the loop, on the item
reference. But, since I call procItem
for calculation and -Og
doesn't inline, call/ret add 1 more load and 1 more store, of the return address:
template <typename T>
T procItem(T item) {
return item & 0b11011101011;
}
constexpr unsigned n_proc = 32;
for (int i=0; i<n_proc; i++) {
for (auto& item : AData) {
item += procItem(item);
//item += item & 0b11011101011;
// this shows twice less L1-dcache loads and stores
}
}
When the return-address load and store are accounted for, L1-dcache
counts add up exactly to the expected numbers.
cache-references
are indeed all LLC referencescache-references
= longest_lat_cache.references
and cache-misses
= longest_lat_cache.misses
which:
Counts the number of cacheable memory requests that miss in the Last Level Cache (LLC). Requests include demand loads, reads for ownership (RFO), instruction fetches and L1 HW prefetches. If the core has access to an L3 cache, the LLC is the L3 cache, otherwise it is the L2 cache.
When the L2 HW prefetcher is turned off, these counts do become about the same as the transactions to DRAM. After disabling the HW prefetcher in BIOS:
$ sudo rdmsr --all 0x1a4
804
...
# after BIOS change:
$ sudo rdmsr --all 0x1a4
805
...
Register 0x1a4
is the MSR for prefetch control in the 12-13 gen P-cores: 2.17.5 in MSR manual version 88, 2025-06. But, I think it is the same in E-cores. (My E-core CPU family is 6_BEh
, which does not have a section in the manual.) The manual "Table 2-47. MSRs Supported by 12th and 13th Generation Intel® Core™ Processor P-core" for 0x1a4
:
Register Address: Hex, Decimal | Register Name |
---|---|
Register Information / Bit Fields | Bit Description |
Register Address: 1A4H, 420 | MSR_PREFETCH_CONTROL |
0 | L2_HARDWARE_PREFETCHER_DISABLE If 1, disables the L2 hardware prefetcher, which fetches additional lines of code or data into the L2 cache. |
1 | L2_ADJACENT_CACHE_LINE_PREFETCHER_DISABLE If 1, disables the adjacent cache line prefetcher, which fetches the cache line that comprises a cache line pair (128 bytes) |
2 | DCU_HARDWARE_PREFETCHER_DISABLE If 1, disables the L1 data cache prefetcher, which fetches the next cache line into L1 data cache. |
3 | DCU_IP_PREFETCHER_DISABLE If 1, disables the L1 data cache IP prefetcher, which uses sequential load history (based on instruction pointer of previous loads) to determine whether to prefetch additional lines. |
So, 0x1a4 = 0x805
means that only L2 HW prefetcher is disabled, not the DCU i.e. L1 prefetcher.
When the L2 prefetcher is disabled, longest_lat_cache.references
(aka cache-references
) get much closer to the counts of DRAM transactions:
0x1a4 = 0x805:
32 repetitions g++ -Og:
Performance counter stats for 'system wide':
17 143 131 658 L1-dcache-loads (62,49%)
17 191 800 959 L1-dcache-stores (62,49%)
10 693 254 LLC-loads (62,50%)
0 LLC-load-misses (62,50%)
22 570 714 LLC-stores (25,00%)
0 LLC-store-misses (25,00%)
558 915 095 cache-references (37,50%)
538 758 089 cache-misses # 96,39% of all cache refs (50,00%)
614 170 946 unc_m_cas_count_rd
560 849 764 unc_m_cas_count_wr
37 485,92 MiB uncore_imc_free_running/data_read/
14,739593106 seconds time elapsed
Here, the amount of data transferred to DRAM is larger than usual when the prefetcher is on: 37GB VS 33-34GB. So, when the HW prefetcher is ON, cache-references
count is a bit higher, but there are less DRAM transactions. Not sure why exactly longest_lat_cache.references
get higher, but it makes the system more efficient.
LLC-[stores|loads]
count only the "demand" stores and loadsFirstly, LLC-loads
is ocr.demand_data_rd.any_response
and LLC-stores
are ocr.demand_rfo.any_response
. I conclude that these counters are the same by running the test with no other counts included, to have no mutexing, like this:
Performance counter stats for './test':
112 321 597 LLC-loads
12 027 750 LLC-stores
112 321 597 ocr.demand_data_rd.any_response
12 027 750 ocr.demand_rfo.any_response
22,237195796 seconds time elapsed
There was an odd case LLC-loads != ocr.demand_data_rd.any_response
, when counting on the whole system:
Performance counter stats for 'system wide':
116 590 144 LLC-loads
12 267 280 LLC-stores
116 590 200 ocr.demand_data_rd.any_response
12 267 280 ocr.demand_rfo.any_response
21,175772075 seconds time elapsed
Not sure what this means. Maybe that's some hiccup in how perf
operates.
These LLC-[loads|stores]
are only the "demand" events, which (as Peter pointed out) are transactions that directly originate from the instructions, not from the prefetcher or the cache eviction & writebacks. The CPU manages the memory in out-of-order fashion with the prefetcher for loads and the writebacks for stores.
In my case, the writeback stores should be counted by ocr.corewb_m.any_response
:
Performance counter stats for 'system wide':
91 968 319 LLC-loads (50,00%)
13 486 144 LLC-stores (50,00%)
6 727 443 LLC-store-misses (50,00%)
1 011 163 180 ocr.corewb_m.any_response (50,00%)
1 125 387 756 unc_m_cas_count_rd
1 055 465 271 unc_m_prefetch_rd
1 043 462 855 unc_m_cas_count_wr
27,041476888 seconds time elapsed
So, LLC-stores
(or LLC-store-misses
?) plus ocr.corewb_m.any_response
should kind of add up to unc_m_cas_count_wr
. And I think they were always close. But I did not pay a lot of attention to store writebacks.
Now, it seems there is no good ocr.*
event for the prefetched loads. There is unc_m_prefetch_rd
from uncore events:
MSR 0x1a4 = 804, the prefetchers are ON
Performance counter stats for 'system wide':
7 936 072 LLC-loads (50,00%)
6 663 105 LLC-stores (50,00%)
3 416 848 LLC-store-misses (50,00%)
262 769 598 ocr.corewb_m.any_response (50,00%)
298 729 402 unc_m_cas_count_rd
267 441 361 unc_m_prefetch_rd
266 849 074 unc_m_cas_count_wr
11,584572736 seconds time elapsed
I thought that by disabling the prefetcher LLC-loads
should get close to unc_m_cas_count_rd
, and unc_m_prefetch_rd
should become zero. But not so quickly:
MSR 0x1a4 = 805, L2 prefetcher is off:
Performance counter stats for 'system wide':
1 608 878 LLC-loads (40,00%)
1 613 284 ocr.demand_data_rd.any_response (60,00%)
8 752 439 LLC-stores (60,00%)
0 LLC-store-misses (40,00%)
265 365 436 ocr.corewb_m.any_response (40,00%)
294 985 003 unc_m_cas_count_rd
10 051 unc_m_prefetch_rd
267 407 004 unc_m_cas_count_wr
11,876177950 seconds time elapsed
Here, I just was writing 0x807
into the register, and its value was being set to 0x805
:
$ sudo wrmsr --all 0x1a4 0x807
$ sudo rdmsr --all 0x1a4
805
805
805
805
Then I noticed that the default value 0x804
should be overwritten with 0x80f
, and finally the prefetcher was completely off:
$ sudo wrmsr --all 0x1a4 0x80f
$ sudo rdmsr --all 0x1a4
80d
80d
80d
80d
Performance counter stats for 'system wide':
266 993 851 longest_lat_cache.miss (83,33%)
269 460 818 longest_lat_cache.reference (83,33%)
253 769 648 LLC-loads (83,33%)
253 716 493 ocr.demand_data_rd.any_response (83,33%)
253 067 848 ocr.demand_data_rd.l3_miss (83,33%)
253 067 530 ocr.demand_data_rd.l3_miss_local (83,33%)
328 065 224 unc_m_cas_count_rd
10 325 unc_m_prefetch_rd
34,463586191 seconds time elapsed
Not sure why unc_m_prefetch_rd
is not zero though.
So, I think, it closes all the points. And, if one wants to do TMA-like analysis of memory transactions, it would look like this:
Bandwidth = Bus Width × Bus Speed × Number of Channels × 2
unc_m_cas_count_rd
(uncore events for the whole system)
unc_m_prefetch_rd
ocr.demand_data_rd.any_response
(CPU core events, per-process)unc_m_cas_count_wr
(uncore events for the whole system)
ocr.corewb_m.any_response
(CPU core events, per-process)ocr.demand_rfo.any_response
(CPU core events, per-process)Update on the DRAM transactions when the L2 & L1 prefetchers are ON or OFF.
longest_lat_cache.reference
and .miss
, but the overall execution time of the program is measurably faster. (10% faster when L2 prefetcher is ON VS when it is OFF.)unc_m_cas_count_[rd|wr]
remain about the same. Except, the unc_m_cas_count_rd
seem to increase slightly. I believe that's due to a background stream of DRAM reads (the following shows it). And the number of longest_lat_cache.reference
goes down to about the same number as the uncore CAS commands.unc_m_cas_count_rd
increases. Again, that seems to be due to the background stream of DRAM reads.The typical performance (a median of execution time in 5 runs) looks like this for 64 repeats and -Og
optimization:
$ sudo rdmsr -a 0x1a4
804
...
Performance counter stats for 'system wide':
33 214 476 658 L1-dcache-loads
33 377 188 158 L1-dcache-stores
1 916 345 898 longest_lat_cache.reference
1 307 250 568 longest_lat_cache.miss
1 042 810 743 unc_m_cas_count_wr
1 102 344 008 unc_m_cas_count_rd
67 281,69 MiB uncore_imc_free_running/data_read/
63 648,06 MiB uncore_imc_free_running/data_write/
22,583609093 seconds time elapsed
With L2 prefetcher disabled:
$ sudo rdmsr -a 0x1a4
805
...
Performance counter stats for 'system wide':
33 238 834 224 L1-dcache-loads
33 390 717 833 L1-dcache-stores
1 056 842 798 longest_lat_cache.reference
1 046 334 269 longest_lat_cache.miss
1 042 839 732 unc_m_cas_count_wr
1 097 031 629 unc_m_cas_count_rd
66 957,32 MiB uncore_imc_free_running/data_read/
63 649,87 MiB uncore_imc_free_running/data_write/
26,643724536 seconds time elapsed
With both L1 and L2 prefetchers disabled:
$ sudo rdmsr -a 0x1a4
80d
Performance counter stats for 'system wide':
33 332 447 582 L1-dcache-loads
33 450 245 337 L1-dcache-stores
1 060 023 042 longest_lat_cache.reference
1 049 331 800 longest_lat_cache.miss
1 043 207 792 unc_m_cas_count_wr
1 168 724 853 unc_m_cas_count_rd
71 333,15 MiB uncore_imc_free_running/data_read/
63 672,37 MiB uncore_imc_free_running/data_write/
71,155762499 seconds time elapsed
These runs were done when logged into i3-wm (good documentation) on Ubuntu 24.04. Then I tried to log out into the Ubuntu login screen, connected via ssh and ran the same test. And I tried to log into Gnome, and ran the same test from ssh. In both cases, I see significantly larger background stream of DRAM reads. It looks like this:
Performance counter stats for 'system wide':
38 395 536 503 L1-dcache-loads
36 373 303 320 L1-dcache-stores
1 173 051 207 longest_lat_cache.reference
1 114 838 203 longest_lat_cache.miss
1 068 149 338 unc_m_cas_count_wr
2 502 982 468 unc_m_cas_count_rd
152 769,70 MiB uncore_imc_free_running/data_read/
65 194,67 MiB uncore_imc_free_running/data_write/
72,310080444 seconds time elapsed
I.e. the execution time is basically the same, but there are twice more reads. This stream of background reads remains the same when I physically power down the monitor. And it goes away, the uncore reads drop back to 69e3 MiB, when I am logged into i3. Powering down the monitor or running i3lock
does not increase the uncore reads. And i3lock
may decrease them by 1e3 MiB. Not sure about that.