I have an Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz
(Haswell
) processor. In a relatively idle situation, I ran the following Perf
commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.any_response
and mem_load_uops_retired.l3_miss
:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.any_response,mem_load_uops_retired.l3_miss sleep 10
Performance counter stats for 'system wide':
3,713,037 offcore_response.all_data_rd.l3_miss.any_response
2,909,573 mem_load_uops_retired.l3_miss
10.016644133 seconds time elapsed
These two values seem consistent, as the latter excludes prefetch requests and those not targeted at DRAM
. But they do not match the read counter in the IMC
. This counter is called UNC_IMC_DRAM_DATA_READS
and documented here. I read the counter reread it 1
second later. The difference was around 30,000,000
(EDITED). If multiplied by 10
(to estimate for 10
seconds) the resulting value will be around 300
million (EDITED), which is 100
times the value of the above-mentioned performance counters (EDITED). It is nowhere near 3
million! What am I missing?
P.S.: The difference is much smaller (but still large), when the system has more load.
The question is also asked, here: https://community.intel.com/t5/Software-Tuning-Performance/Performance-Counters-and-IMC-Counter-Not-Matching/m-p/1288832
UPDATE:
Please note that PCM
output matches my IMC
counter reads.
This is the relevant PCM
output:
The values for columns
READ
, WRITE
and IO
are calculated based on UNC_IMC_DRAM_DATA_READS
, UNC_IMC_DRAM_DATA_WRITES
and UNC_IMC_DRAM_IO_REQUESTS
, respectively. It seems that requests classified as IO
will be either READ
or WRITE
. In other words, during the depicted one second interval, almost (because of the inaccuracy reported in the above-mentioned doc) 2.01
GB of the 2.42
GB READ
and WRITE
requests belong to IO
. Based on this explanation, the above three columns seem consistent with each other.
The problem is that there still exists a LARGE gap between the IMC
and PMC
values!
The situation is the same when I boot in runlevel 1
. The processes on the scheduler are one of swapper
, kworker
and migration
. Disk IO is almost 85
KB/s. I'm wondering what leads to such a (relatively) huge amount of IO
. Is it possible to detect that (e.g., using a counter
or a tool)?
UPDATE 2:
I think that there is something wrong with the IO
column. It is always something in the range [1.99,2.01]
, regardless of the amount of load in the system!
UPDATE 3:
In runlevel 1
, the average number of occurrences of the uops_retired.all
event in a 1-second interval is 15,000,000
. During the same period, the number of read requests recorded by the associated IMC
counter is around 30,000,000
. In other words, assuming that all memory accesses are directly caused by cpu instructions, for each retired micro-operation, there exists two memory accesses. This seems impossible specially concerning the fact that there exist multiple levels of caches. Therefore, in the idle scenario, perhaps, the read accesses are caused by IO
.
Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM
on a relatively idle system with resolution 3840x2160
and refresh rate 60
using xrandr:
And this is for the situation with resolution
800x600
and the same refresh rate (i.e., 60
):
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than
100x
!).