linux linux-kernel cpu-cache perf intel-pmu

only 2 PERF_TYPE_HW_CACHE events in perf event group

Working on a custom implementation on top of perf_event_open I need to monitor multiple PERF_TYPE_HW_CACHE concurrently.

The Intel manual states that there are 4 programmable counters per thread (or 8 if HyperThreading is disabled) for my CPU's architecture. So I grouped the PERF_TYPE_HW_CACHE events of choice into 1 perf event group containing PERF_TYPE_HW_CACHE 4 events (LLC_GROUP).

I run a first experiment and I got the following results:

LLC_GROUP of thread 2 | time Enabled: 3190370379, time Running: 3017
HW_CACHE_LLC_READ_MISSES = 0
HW_CACHE_LLC_WRITE_MISSES = 0
HW_CACHE_LLC_READS = 0
HW_CACHE_LLC_WRITES = 0

From the above results, it is clear that the PMU does not "fit" all the 4 events. We also observe a "strange" multiplexing without actual results..

So, as a next move, I split the 4-events group into 2 groups of 2 events/group (LLC_GROUP, LLC2_GROUP) and the result I got are the following:

LLC_GROUP of thread 2 | time Enabled: 2772569406, time Running: 1396022331
HW_CACHE_LLC_READ_MISSES = 102117
HW_CACHE_LLC_WRITE_MISSES = 9624295
LLC2_GROUP of thread 2 | time Enabled: 2772571024, time Running: 1376575096
HW_CACHE_LLC_READS = 22020658
HW_CACHE_LLC_WRITES = 18156060

With this configuration, we observe again that the PMU doesn't "fit" 4 PERF_TYPE_HW_CACHE concurrently but this time the (expected) multiplexing is happening.

Does anyone have any explanation?

This behaviour looks very strange to me since I'm able to monitor multiple PERF_TYPE_HARDWARE events (up to 6) without multiplexing and I would expect the same to be happening for the PERF_TYPE_HW_CACHE events as well.

Solution

Note that, perf does allow measuring more than 2 PERF_TYPE_HW_CACHE events at the same time, the exception being the measurement of LLC-cache events.

The expectation is that, when there are 4 general-purpose and 3 fixed-purpose hardware counters, 4 HW cache events (which default to RAW events) in perf can be measured without multiplexing, with hyper-threading ON.

sudo perf stat -e L1-icache-load-misses,L1-dcache-stores,L1-dcache-load-misses,dTLB-load-misses sleep 2

 Performance counter stats for 'sleep 2':

            26,893      L1-icache-load-misses                                       
            98,999      L1-dcache-stores                                            
            14,037      L1-dcache-load-misses                                       
               723      dTLB-load-misses                                            

       2.001732771 seconds time elapsed

       0.001217000 seconds user
       0.000000000 seconds sys

The problem appears when you try to measure events targeting the LLC-cache. It seems to be measuring only 2 LLC-cache specific events, concurrently, without multiplexing.

sudo perf stat -e LLC-load-misses,LLC-stores,LLC-store-misses,LLC-loads sleep 2

 Performance counter stats for 'sleep 2':

             2,419      LLC-load-misses           #    0.00% of all LL-cache hits   
             2,963      LLC-stores                                                  
     <not counted>      LLC-store-misses                                              (0.00%)
     <not counted>      LLC-loads                                                     (0.00%)

       2.001486710 seconds time elapsed

       0.001137000 seconds user
       0.000000000 seconds sys

CPUs belonging to the skylake/kaby lake family of microarchitectures and some others, allow you to measure OFFCORE RESPONSE events. Monitoring OFFCORE_RESPONSE events requires programming extra MSRs, specifically, MSR_OFFCORE_RSP0 (MSR address 1A6H) and MSR_OFFCORE_RSP1 (MSR address 1A7H), in addition to programming the pair of IA32_PERFEVTSELx and IA32_PMCx registers.

Each pair of IA32_PERFEVTSELx and IA32_PMCx register will be associated with one of the above MSRs to measure LLC-cache events.

The definition of the OFFCORE_RESPONSE MSRs can be seen here.

static struct extra_reg intel_skl_extra_regs[] __read_mostly = {
    INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff8fffull, RSP_0),
    INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff8fffull, RSP_1),
    ........
}

0x01b7 in the INTEL_UEVENT_EXTRA_REG call refers to event-code b7 and umask 01. This event code 0x01b7 maps to LLC-cache events, as can be seen here -

[ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = 0x1b7,   /* OFFCORE_RESPONSE */
        [ C(RESULT_MISS)   ] = 0x1b7,   /* OFFCORE_RESPONSE */
    },
    [ C(OP_WRITE) ] = {
        [ C(RESULT_ACCESS) ] = 0x1b7,   /* OFFCORE_RESPONSE */
        [ C(RESULT_MISS)   ] = 0x1b7,   /* OFFCORE_RESPONSE */
    },
    [ C(OP_PREFETCH) ] = {
        [ C(RESULT_ACCESS) ] = 0x0,
        [ C(RESULT_MISS)   ] = 0x0,
    },
 },

The event 0x01b7 will always map to MSR_OFFCORE_RSP_0, as can be seen here. The function, specified above, loops through the array of all the "extra registers" and associates the event->config(which contains the raw event id) with the offcore response MSR.

So, this would mean only one event can be measured at a time, since only one MSR - MSR_OFFCORE_RSP_0 can be mapped to a LLC-cache event. But, that is not the case!

The offcore registers are symmetric in nature, so when the first MSR - MSR_OFFCORE_RSP_0 register is busy, perf uses the second alternative MSR, MSR_OFFCORE_RSP_1 for measuring another offcore LLC event. This function here helps in doing that.

static int intel_alt_er(int idx, u64 config)
{
    int alt_idx = idx;

    if (!(x86_pmu.flags & PMU_FL_HAS_RSP_1))
        return idx;

    if (idx == EXTRA_REG_RSP_0)
        alt_idx = EXTRA_REG_RSP_1;

    if (idx == EXTRA_REG_RSP_1)
        alt_idx = EXTRA_REG_RSP_0;

    if (config & ~x86_pmu.extra_regs[alt_idx].valid_mask)
        return idx;

    return alt_idx;
}

The presence of only 2 offcore registers, for Kaby-Lake family of microrarchitectures hinder the ability to target more than 2 LLC-cache event measurement concurrently, without any multiplexing.