I want to retrieve the number of DRAM accesses in my application. Precisely, I need to distinguish between data and code accesses. The processor is an Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz
(Haswell
). Based on Intel Software Developer's Manual, Volume 3 and Perf
, I could find and categorize the following memory-access-related events:
(A)
LLC-load-misses [Hardware cache event]
LLC-loads [Hardware cache event]
LLC-store-misses [Hardware cache event]
LLC-stores [Hardware cache event]
=========================================================================
(B)
mem_load_uops_l3_miss_retired.local_dram
mem_load_uops_retired.l3_miss
=========================================================================
(C)
offcore_response.all_code_rd.l3_miss.any_response
offcore_response.all_code_rd.l3_miss.local_dram
offcore_response.all_data_rd.l3_miss.any_response
offcore_response.all_data_rd.l3_miss.local_dram
offcore_response.all_reads.l3_miss.any_response
offcore_response.all_reads.l3_miss.local_dram
offcore_response.all_requests.l3_miss.any_response
=========================================================================
(D)
offcore_response.all_rfo.l3_miss.any_response
offcore_response.all_rfo.l3_miss.local_dram
=========================================================================
(E)
offcore_response.demand_code_rd.l3_miss.any_response
offcore_response.demand_code_rd.l3_miss.local_dram
offcore_response.demand_data_rd.l3_miss.any_response
offcore_response.demand_data_rd.l3_miss.local_dram
offcore_response.demand_rfo.l3_miss.any_response
offcore_response.demand_rfo.l3_miss.local_dram
=========================================================================
(F)
offcore_response.pf_l2_code_rd.l3_miss.any_response
offcore_response.pf_l2_data_rd.l3_miss.any_response
offcore_response.pf_l2_rfo.l3_miss.any_response
offcore_response.pf_l3_code_rd.l3_miss.any_response
offcore_response.pf_l3_data_rd.l3_miss.any_response
offcore_response.pf_l3_rfo.l3_miss.any_response
My choices are as follows:
LLC-load-misses
and LLC-store-misses
will return the whole DRAM accesses (equivalently, I could use
LLC-misses
in Perf
).mem_load_uops_retired.l3_miss
.
It does not include stores, but seems to be OK (because stores seem
to be much less frequent?!).LLC-load-misses
- mem_load_uops_retired.l3_miss
=
DRAM Accesses for Code
(As code is read-only).Are these choices reasonable?
My other questions: (The 2nd one is the most important)
local_dram
and any_response
?offcore_response.all_reads.l3_miss.any_response
events were twice as many as LLC-load-misses
.demand reads
(i.e., all non-prefetched
reads). Does this mean that, e.g.: offcore_response.all_data_rd.l3_miss.any_response
- offcore_response.demand_data_rd.l3_miss.any_response
= DRAM read accesses caused by prefeching?Group (D), includes DRAM access events caused by Read for Ownership
operations (for Cache Coherency
Protocols). It seems irrelevant to my problem.
Group (F), counts DRAM reads caused by L2-cache
prefetcher which is also irrelevant to my problem.
Based on my understanding of the question, I recommend using the following two events on the specified processor:
OFFCORE_RESPONSE.ALL_READS.L3_MISS.LOCAL_DRAM
: This includes all cacheable data read and write transactions and all code fetch transactions, whether the transaction is initiated by a instruction (retired or not) or a prefetch or any type. Each event represents exactly a 64-byte read request to the memory controller.OFFCORE_RESPONSE.ALL_CODE_RD.L3_MISS.LOCAL_DRAM
: This includes all the code fetch accesses to the IMC.(I think both of these event don't occur for uncacheable code fetch requests, but I've not tested this and the documentation is not clear on this.)
The "data accesses" can be measured separately from the "code accesses" by subtracting the second event from the first. These two events can be counted simultaneously on the same logical core on Haswell without multiplexing.
There are of course other transactions that do go to the IMC but are not counted by either of the two mentioned events. These include: (1) L3 writebacks, (2) uncacheable partial reads and writes from cores, (3) full WCB evictions, and (4) memory accesses from IO devices. Depending on the workload, It's not unlikely that accesses of types (1), (3), and (4) may constitute a significant fraction of total accesses to the IMC.
It seems that the sum of LLC-load-misses and LLC-store-misses will return the whole DRAM accesses (equivalently, I could use LLC-misses in Perf).
Note the following:
LLC-load-misses
is a perf
event mapped to the native event OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_MISS.ANY_RESPONSE
.LLC-store-misses
is mapped to OFFCORE_RESPONSE.DEMAND_RFO.L3_MISS.ANY_RESPONSE
.These are not the events you want because:
ANY_RESPONSE
bit indicates that the event can occur for requests that target any unit, not just the IMC.For data-only accesses, I used mem_load_uops_retired.l3_miss. It does not include stores, but seems to be OK (because stores seem to be much less frequent?!).
There are a number of issues with using mem_load_uops_retired.l3_miss
on Haswell:
LLC-load-misses
- mem_load_uops_retired.l3_miss
= DRAM Accesses for Code" is incorrect.What are local_dram and any_response?
Not all requests that miss in the L3 go to the IMC. A typical example is memory-mapped IO requests. You said you only want the core-originated requests that go to the IMC, so local_dram
is the right bit.
At first, it seems that, group (C), is a higher resolution version of the load events of group (A). But my tests show that the events in the former group is much more frequent than the latter. For example, in a simple benchmark, the number of offcore_response.all_reads.l3_miss.any_response events were twice as many as LLC-load-misses.
This is normal because offcore_response.all_reads.l3_miss.any_response
is inclusive of LLC-load-misses
and can easily be significantly larger.
Group (E), pertains to demand reads (i.e., all non-prefetched reads). Does this mean that, e.g.: offcore_response.all_data_rd.l3_miss.any_response - offcore_response.demand_data_rd.l3_miss.any_response = DRAM read accesses caused by prefeching?
No, because:
any_response
bit as explained above,