performance-testing intel performancecounter perf memory-access

Performance Counters for DRAM Accesses

I want to retrieve the number of DRAM accesses in my application. Precisely, I need to distinguish between data and code accesses. The processor is an Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz (Haswell). Based on Intel Software Developer's Manual, Volume 3 and Perf, I could find and categorize the following memory-access-related events:

(A)
LLC-load-misses                                    [Hardware cache event]
LLC-loads                                          [Hardware cache event]
LLC-store-misses                                   [Hardware cache event]
LLC-stores                                         [Hardware cache event]
=========================================================================
(B)
mem_load_uops_l3_miss_retired.local_dram          
mem_load_uops_retired.l3_miss  
=========================================================================
(C)
offcore_response.all_code_rd.l3_miss.any_response 
offcore_response.all_code_rd.l3_miss.local_dram   
offcore_response.all_data_rd.l3_miss.any_response 
offcore_response.all_data_rd.l3_miss.local_dram   
offcore_response.all_reads.l3_miss.any_response   
offcore_response.all_reads.l3_miss.local_dram     
offcore_response.all_requests.l3_miss.any_response
=========================================================================
(D)
offcore_response.all_rfo.l3_miss.any_response     
offcore_response.all_rfo.l3_miss.local_dram       
=========================================================================
(E)
offcore_response.demand_code_rd.l3_miss.any_response
offcore_response.demand_code_rd.l3_miss.local_dram
offcore_response.demand_data_rd.l3_miss.any_response
offcore_response.demand_data_rd.l3_miss.local_dram
offcore_response.demand_rfo.l3_miss.any_response  
offcore_response.demand_rfo.l3_miss.local_dram    
=========================================================================
(F)
offcore_response.pf_l2_code_rd.l3_miss.any_response
offcore_response.pf_l2_data_rd.l3_miss.any_response
offcore_response.pf_l2_rfo.l3_miss.any_response   
offcore_response.pf_l3_code_rd.l3_miss.any_response
offcore_response.pf_l3_data_rd.l3_miss.any_response
offcore_response.pf_l3_rfo.l3_miss.any_response

My choices are as follows:

It seems that the sum of LLC-load-misses and LLC-store-misses will return the whole DRAM accesses (equivalently, I could use LLC-misses in Perf).
For data-only accesses, I used mem_load_uops_retired.l3_miss. It does not include stores, but seems to be OK (because stores seem to be much less frequent?!).
Simplistically, LLC-load-misses - mem_load_uops_retired.l3_miss = DRAM Accesses for Code (As code is read-only).

Are these choices reasonable?

My other questions: (The 2nd one is the most important)

What are local_dram and any_response?
At first, it seems that, group (C), is a higher resolution version of the load events of group (A). But my tests show that the events in the former group is much more frequent than the latter. For example, in a simple benchmark, the number of offcore_response.all_reads.l3_miss.any_response events were twice as many as LLC-load-misses.
Group (E), pertains to demand reads (i.e., all non-prefetched reads). Does this mean that, e.g.: offcore_response.all_data_rd.l3_miss.any_response - offcore_response.demand_data_rd.l3_miss.any_response = DRAM read accesses caused by prefeching?

Group (D), includes DRAM access events caused by Read for Ownership operations (for Cache Coherency Protocols). It seems irrelevant to my problem.

Group (F), counts DRAM reads caused by L2-cache prefetcher which is also irrelevant to my problem.

Solution

Based on my understanding of the question, I recommend using the following two events on the specified processor:

OFFCORE_RESPONSE.ALL_READS.L3_MISS.LOCAL_DRAM: This includes all cacheable data read and write transactions and all code fetch transactions, whether the transaction is initiated by a instruction (retired or not) or a prefetch or any type. Each event represents exactly a 64-byte read request to the memory controller.
OFFCORE_RESPONSE.ALL_CODE_RD.L3_MISS.LOCAL_DRAM: This includes all the code fetch accesses to the IMC.

(I think both of these event don't occur for uncacheable code fetch requests, but I've not tested this and the documentation is not clear on this.)

The "data accesses" can be measured separately from the "code accesses" by subtracting the second event from the first. These two events can be counted simultaneously on the same logical core on Haswell without multiplexing.

There are of course other transactions that do go to the IMC but are not counted by either of the two mentioned events. These include: (1) L3 writebacks, (2) uncacheable partial reads and writes from cores, (3) full WCB evictions, and (4) memory accesses from IO devices. Depending on the workload, It's not unlikely that accesses of types (1), (3), and (4) may constitute a significant fraction of total accesses to the IMC.

It seems that the sum of LLC-load-misses and LLC-store-misses will return the whole DRAM accesses (equivalently, I could use LLC-misses in Perf).

Note the following:

The event LLC-load-misses is a perf event mapped to the native event OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_MISS.ANY_RESPONSE.
The event LLC-store-misses is mapped to OFFCORE_RESPONSE.DEMAND_RFO.L3_MISS.ANY_RESPONSE.

These are not the events you want because:

The ANY_RESPONSE bit indicates that the event can occur for requests that target any unit, not just the IMC.
These events count L1 data prefetches and page walk requests, but not L2 data prefetches. You'd want to count all prefetches that consume memory bandwdith in general.

For data-only accesses, I used mem_load_uops_retired.l3_miss. It does not include stores, but seems to be OK (because stores seem to be much less frequent?!).

There are a number of issues with using mem_load_uops_retired.l3_miss on Haswell:

There are cases where this event is unreliable, so it should be avoided if there are alternatives. Otherwise, the analysis methodology should take in to account the potential unreliability of this event count.
The event only occurs for requests from retired loads and it omits speculative loads and all stores, which can be significant.
Doing arithmetic with this events and other events in a meaningful way is not easy. For example, your suggestion of doing "LLC-load-misses - mem_load_uops_retired.l3_miss = DRAM Accesses for Code" is incorrect.

What are local_dram and any_response?

Not all requests that miss in the L3 go to the IMC. A typical example is memory-mapped IO requests. You said you only want the core-originated requests that go to the IMC, so local_dram is the right bit.

At first, it seems that, group (C), is a higher resolution version of the load events of group (A). But my tests show that the events in the former group is much more frequent than the latter. For example, in a simple benchmark, the number of offcore_response.all_reads.l3_miss.any_response events were twice as many as LLC-load-misses.

This is normal because offcore_response.all_reads.l3_miss.any_response is inclusive of LLC-load-misses and can easily be significantly larger.

Group (E), pertains to demand reads (i.e., all non-prefetched reads). Does this mean that, e.g.: offcore_response.all_data_rd.l3_miss.any_response - offcore_response.demand_data_rd.l3_miss.any_response = DRAM read accesses caused by prefeching?

No, because:

the any_response bit as explained above,
this subtraction results in only the L2 data load prefetches, not all data load hardware and software prefetches.