performance-testingintelperformancecounterperfmemory-access

Performance Counters for DRAM Accesses


I want to retrieve the number of DRAM accesses in my application. Precisely, I need to distinguish between data and code accesses. The processor is an Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz (Haswell). Based on Intel Software Developer's Manual, Volume 3 and Perf, I could find and categorize the following memory-access-related events:

(A)
LLC-load-misses                                    [Hardware cache event]
LLC-loads                                          [Hardware cache event]
LLC-store-misses                                   [Hardware cache event]
LLC-stores                                         [Hardware cache event]
=========================================================================
(B)
mem_load_uops_l3_miss_retired.local_dram          
mem_load_uops_retired.l3_miss  
=========================================================================
(C)
offcore_response.all_code_rd.l3_miss.any_response 
offcore_response.all_code_rd.l3_miss.local_dram   
offcore_response.all_data_rd.l3_miss.any_response 
offcore_response.all_data_rd.l3_miss.local_dram   
offcore_response.all_reads.l3_miss.any_response   
offcore_response.all_reads.l3_miss.local_dram     
offcore_response.all_requests.l3_miss.any_response
=========================================================================
(D)
offcore_response.all_rfo.l3_miss.any_response     
offcore_response.all_rfo.l3_miss.local_dram       
=========================================================================
(E)
offcore_response.demand_code_rd.l3_miss.any_response
offcore_response.demand_code_rd.l3_miss.local_dram
offcore_response.demand_data_rd.l3_miss.any_response
offcore_response.demand_data_rd.l3_miss.local_dram
offcore_response.demand_rfo.l3_miss.any_response  
offcore_response.demand_rfo.l3_miss.local_dram    
=========================================================================
(F)
offcore_response.pf_l2_code_rd.l3_miss.any_response
offcore_response.pf_l2_data_rd.l3_miss.any_response
offcore_response.pf_l2_rfo.l3_miss.any_response   
offcore_response.pf_l3_code_rd.l3_miss.any_response
offcore_response.pf_l3_data_rd.l3_miss.any_response
offcore_response.pf_l3_rfo.l3_miss.any_response

My choices are as follows:

Are these choices reasonable?


My other questions: (The 2nd one is the most important)

Group (D), includes DRAM access events caused by Read for Ownership operations (for Cache Coherency Protocols). It seems irrelevant to my problem.

Group (F), counts DRAM reads caused by L2-cache prefetcher which is also irrelevant to my problem.


Solution

  • Based on my understanding of the question, I recommend using the following two events on the specified processor:

    (I think both of these event don't occur for uncacheable code fetch requests, but I've not tested this and the documentation is not clear on this.)

    The "data accesses" can be measured separately from the "code accesses" by subtracting the second event from the first. These two events can be counted simultaneously on the same logical core on Haswell without multiplexing.

    There are of course other transactions that do go to the IMC but are not counted by either of the two mentioned events. These include: (1) L3 writebacks, (2) uncacheable partial reads and writes from cores, (3) full WCB evictions, and (4) memory accesses from IO devices. Depending on the workload, It's not unlikely that accesses of types (1), (3), and (4) may constitute a significant fraction of total accesses to the IMC.

    It seems that the sum of LLC-load-misses and LLC-store-misses will return the whole DRAM accesses (equivalently, I could use LLC-misses in Perf).

    Note the following:

    These are not the events you want because:

    For data-only accesses, I used mem_load_uops_retired.l3_miss. It does not include stores, but seems to be OK (because stores seem to be much less frequent?!).

    There are a number of issues with using mem_load_uops_retired.l3_miss on Haswell:

    What are local_dram and any_response?

    Not all requests that miss in the L3 go to the IMC. A typical example is memory-mapped IO requests. You said you only want the core-originated requests that go to the IMC, so local_dram is the right bit.

    At first, it seems that, group (C), is a higher resolution version of the load events of group (A). But my tests show that the events in the former group is much more frequent than the latter. For example, in a simple benchmark, the number of offcore_response.all_reads.l3_miss.any_response events were twice as many as LLC-load-misses.

    This is normal because offcore_response.all_reads.l3_miss.any_response is inclusive of LLC-load-misses and can easily be significantly larger.

    Group (E), pertains to demand reads (i.e., all non-prefetched reads). Does this mean that, e.g.: offcore_response.all_data_rd.l3_miss.any_response - offcore_response.demand_data_rd.l3_miss.any_response = DRAM read accesses caused by prefeching?

    No, because: