I have an Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz
(Haswell
) processor (Linux
4.15.0-20-generic
kernel). In a relatively idle situation, I ran the following Perf
commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.local_dram
, offcore_response.all_code_rd.l3_miss.local_dram
and mem_load_uops_retired.l3_miss
:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss
^C
Performance counter stats for 'system wide':
229,579 offcore_response.all_data_rd.l3_miss.local_dram (99.72%)
489,151 offcore_response.all_code_rd.l3_miss.local_dram (99.77%)
110,543 mem_load_uops_retired.l3_miss (99.79%)
2.868899111 seconds time elapsed
As can be seen, event multiplexing occurred due to PMU sharing among these three events. In a similar scenario, I used the same command, except that I appended :D
(mentioned in The Linux perf Event Scheduling Algorithm) to prevent multiplexing for the third event:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D
^C
Performance counter stats for 'system wide':
539,397 offcore_response.all_data_rd.l3_miss.local_dram (68.71%)
890,344 offcore_response.all_code_rd.l3_miss.local_dram (68.67%)
193,555 mem_load_uops_retired.l3_miss:D
2.853095575 seconds time elapsed
But adding :D
is leading to much larger values for all counters and this seems to occur only when event multiplexing occurs. Is this output normal? Are the percentage values in parentheses valid? How could the differences in counter values be prevented?
UPDATE:
I also traced the following loop implementation:
#include <iostream>
using namespace std;
int main()
{
for (unsigned long i = 0; i < 3 * 1e9; i++)
;
return 0;
}
This time Perf
was executed 7 (both with and without :D
) but not using the -a
option. The commands are as follows:
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss ./loop
and
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D ./loop
The values of the three counters are compared in the following figures:
For all_data_rd
, the :D
has almost no effect, for all_code
reduces the values and for load_uops_retired
causes larger values.
UPDATE 2:
Based on Peter Cordes
's comments, I used a memory-intensive program as follows:
#include <iostream>
#include <cstring>
#define DIM_SIZE 10000
using namespace std;
char from [DIM_SIZE][DIM_SIZE], to [DIM_SIZE][DIM_SIZE];
int main()
{
for (char x = 'a'; x <= 'z'; x++)
{
// set the 100 million element char array 'from'
to x
for (int i = 0; i < DIM_SIZE; i++)
memset(from[i], x, DIM_SIZE);
// copy the entire char array 'from' to char array 'to'
for (int i = 0; i < DIM_SIZE; i++)
memcpy(to[i], from[i], DIM_SIZE);
}
return 0;
}
The following Perf
commands and outputs show that the counter values are almost the same:
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D ./loop
Performance counter stats for './loop':
19,836,745 offcore_response.all_data_rd.l3_miss.local_dram (50.04%)
47,309 offcore_response.all_code_rd.l3_miss.local_dram (49.96%)
6,556,957 mem_load_uops_retired.l3_miss:D
0.592795335 seconds time elapsed
and
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss ./loop
Performance counter stats for './loop':
18,742,540 offcore_response.all_data_rd.l3_miss.local_dram (66.64%)
76,854 offcore_response.all_code_rd.l3_miss.local_dram (66.64%)
6,967,919 mem_load_uops_retired.l3_miss (66.72%)
0.575828303 seconds time elapsed
In your first two tests, you're not doing anything to generate a consistent amount of off-core memory traffic, so all you're measuring is the background "noise" on a mostly-idle system. (System-wide for the first test with -a
, for the second just interrupt handlers which run some code that does miss in cache while this task is the current
on a logical core.)
You never said anything about whether those differences are repeatable across runs, e.g. with perf stat ... -r5
to re-run the same test 5 times, and print average and variance. I'd expect it's not repeatable, just random fluctuation of background stuff like network and keyboard/mouse interrupts, but if the :D
version makes a consistent or statistically significant difference that might be interesting.
In your 2nd test, the loop you used won't create any extra memory traffic; it will either be purely registers, or if compiled without optimization will load/store the same one cache line so everything will hit in L1d cache. So you still get a tiny amount of counts from it, mostly from the CRT startup and libc init code. From your main
itself, probably one code fetch and zero data loads that miss in L3 since stack memory was probably already hot in cache.
As I commented, a loop that does some memcpy or memset might make sense, or a simple known-good already written example is the STREAM benchmark, so that third test is finally something useful.
The code-read counts in the 3rd test are also just idle noise: glibc memset and memcpy are unrolled but small enough that they fit in the uop cache, so it doesn't even need to touch L1i cache.
We do have an interesting difference in counts for mem_load_uops_retired.l3_miss
with / without :D
, and off-core data-reads L3 misses.
With the program having two different phases (memset and memcpy), different multiplexing timing will get sample different amounts of that loop. Extrapolating to the whole time from a few of those sample periods won't always be correct.
If memset
uses NT stores (or rep stosb
on a CPU with ERMSB), it won't be doing any reads, just writes. Those methods use a no-RFO store protocol to avoid fetching the old value of a cache line, because they're optimized for overwriting the full line. Without that, plain stores could generate offcore_response.all_data_rd.l3_miss.local_dram
I think, if RFO (Read For Ownership) count as part of all-data-read. (On my Skylake, perf list
has separate events for offcore_response.demand_data_rd
vs. offcore_response.demand_rfo
, but "all" would include both, and prefetches (non-"demand").)
memcpy
has to read its source data, so it has a source of mem_load_uops_retired.l3_miss
. It can also use NT stores or ERMSB to avoid load requests on the destination.
Glibc memcpy / memset do use NT stores or rep movsb
/ rep stosb
for large-enough buffers. So there will be a large difference in rate of offcore_response.all_data_rd.l3_miss.local_dram
as well as mem_load_uops_retired.l3_miss
between the memcpy and memset portions of your workload.
(I wrote memcpy or memset when I was suggesting creating a simple consistent workload because that would make it uniform with time. I hadn't looked carefully at what events you were counting, that it was only load uops and offcore load requests. So memset
wasn't a useful suggestion.)
And of course code reads that miss all the way to L3 are hard to generate. In hand-written asm you could make a huge block of fully-unrolled instructions with GAS .rept 10000000
; lea %fs:0x1234(,%rax,4), %rcx
; .endr
. Or with NASM times 10000000 lea rcx, [fs: 0x1234 + rax*4]
. (The segment prefix is just there to make it longer, no effect on the result.) That's a long-ish instruction (7 bytes) with no false dependency that can run at least 2 per clock on modern Intel and AMD CPUs, so should go quickly through the CPUs legacy decoders, maybe fast enough to exceed prefetch and result in lots of demand-loads for code that miss in L3.