perfmultiplexingintel-pmu

PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE concurrent monitoring


I'm working on a custom implementation on top of perf_event_open syscall.

The implementation aims to support various of PERF_TYPE_HARDWARE, PERF_TYPE_SOFTWARE and PERF_TYPE_HW_CACHE events for specific threads on any core.

In Intel® 64 and IA-32 Architectures Software Developer’s Manual vol 3B I see the following for my testing CPU (Kaby Lake):

enter image description here

To my understanding so far, one can monitor (theoretically) unlimited PERF_TYPE_SOFTWARE events concurrently but limited (without multiplexing) PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events concurrently since each event is measured by the limited (as can be seen on the manual above) number of counters of the CPU's PMU.

So for a quad-core Kaby Lake CPU with HyperThreading enabled I assume that up to 4 PERF_TYPE_HARDWARE/PERF_TYPE_HW_CACHE events can be monitored concurrently (or up to 8 if only 4 threads are used).

Experimenting with the above assumptions I found out that while I can successfully monitor up to 4 PERF_TYPE_HARDWARE events (for 8 threads) this is not the case for PERF_TYPE_HW_CACHE events where only up to 2 events can be monitored concurrently!

I also tried to use only 4 threads but the upper limit of concurrently monitored 'PERF_TYPE_HARDWARE' events remains 4. The same is happening with HyperThreading disabled!

One could ask: why do you need to avoid multiplexing. First of all, the implementation needs to be as much accurate as possible by avoiding the potential blind spots of multiplexing and secondly when the "upper limit" is exceeded all event values are 0...

The PERF_TYPE_HW_CACHE events I'm targeting are:

CACHE_LLC_READ(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_READ.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_ACCESS.value << 16),
CACHE_LLC_WRITE(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_WRITE.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_ACCESS.value << 16),
CACHE_LLC_READ_MISS(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_READ.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_MISS.value << 16),
CACHE_LLC_WRITE_MISS(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_WRITE.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_MISS.value << 16),

all are implemented with the provided formula:

(perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
(perf_hw_cache_op_result_id << 16)

and are manipulated as a group (the first is the group leader etc).

So, my questions are the following:

  1. Which counters of the PMU are used for PERF_TYPE_HARDWARE and which for PERF_TYPE_HW_CACHE events and where can I find this information?
  2. What is the difference between the PERF_TYPE_HARDWARE pre-defined events (such as PERF_COUNT_HW_CACHE_MISSES) and the PERF_TYPE_HW_CACHE events?
  3. Any suggestions on how to monitor without multiplexing all listed PERF_TYPE_HW_CACHE events?
  4. Any suggestions on how to monitor without multiplexing up to 8 PERF_TYPE_HARDWARE or/and PERF_TYPE_HW_CACHE events?

Thanks in advance!


Solution

    1. The PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events are mapped to two sets of registers involved in performance monitoring. The first set of MSRs are called IA32_PERFEVTSELx where x can vary from 0 to N-1, N being the total number of general purpose counters available. The PERFEVTSEL is a short for "performance event select", they specify various conditions on the fulfillment of which event counting will happen. The second set of MSRs are called IA32_PMCx, where x varies similarly as PERFEVTSEL. These PMC registers store the counts of performance monitoring events. Each PERFEVTSEL register is paired with a corresponding PMC register.

    The mapping happens as follows-

    At the initialization of the architecture specific portion of the kernel, a pmu for measuring hardware specific events is registered here with type PERF_TYPE_RAW. All PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events are mapped to PERF_TYPE_RAW events to identify the pmu, as can be seen here.

    if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE)
            type = PERF_TYPE_RAW;
    

    The same architecture specific initialization is responsible for setting up the addresses of the first/base registers of each of the aforementioned sets of performance monitoring event registers, here

        .eventsel       = MSR_ARCH_PERFMON_EVENTSEL0,
        .perfctr        = MSR_ARCH_PERFMON_PERFCTR0,
    

    The event_init function specific to the PMU identified, is responsible for setting up and "reserving" the two sets of performance monitoring registers, as well as checking for event constraints etc., here. The reservation happens here.

    for (i = 0; i < x86_pmu.num_counters; i++) {
            if (!reserve_perfctr_nmi(x86_pmu_event_addr(i)))
                goto perfctr_fail;
        }
    
        for (i = 0; i < x86_pmu.num_counters; i++) {
            if (!reserve_evntsel_nmi(x86_pmu_config_addr(i)))
                goto eventsel_fail;
        }
    

    The value num_counters = number of general-purpose counters as identified by CPUID instruction.

    In addition to this, there are a couple of extra registers that monitor offcore events (eg. the LLC-cache specific events).

    In later versions of architectural performance monitoring, some of the hardware events are measured with the help of fixed-purpose registers, as seen here. These are the fixed-purpose registers -

    #define MSR_ARCH_PERFMON_FIXED_CTR0 0x309
    #define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a
    #define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b
    
    1. The PERF_TYPE_HARDWARE pre-defined events are all architectural performance monitoring events. These events are architectural, since the behavior of each architectural performance event is expected to be consistent on all processors that support that event. All of the PERF_TYPE_HW_CACHE events are non-architectural, which means they are model-specific and may vary from one family of processors to another.

    2. For an Intel Kaby Lake machine that I have, a total of 20 PERF_TYPE_HW_CACHE events are pre-defined. The event constraints involved, ensure that the 3 fixed-function counters available are mapped to 3 PERF_TYPE_HARDWARE architectural events. Only one event can be measured on each of the fixed-function counters, so we can discard them for our analysis. The other constraint is that only two events targeting the LLC-caches, can be measured at the same time, since there are only two OFFCORE RESPONSE registers. Also, the nmi-watchdog may pin an event to another counter from the family of general-purpose counters. If the nmi-watchdog is disabled, we are left with 4 general purpose counters.

    Given the constraints involved, and the limited number of counters available, there is just no way to avoid multiplexing if all the 20 hardware cache events are measured at the same time. Some workarounds to measure all the events, without incurring multiplexing and its errors, are -

    3.1. Group all the PERF_TYPE_HW_CACHE events into groups of 4, such that all of the 4 events can be scheduled on each of the 4 general-purpose counters at the same time. Make sure there are no more than 2 LLC cache events in a group. Run the same profile and obtain the counts for each of the groups separately.

    3.2. If all the PERF_TYPE_HW_CACHE events are to be monitored at the same time, then some of the errors with multiplexing can be reduced, by decreasing the value of perf_event_mux_interval_ms. It can be configured via a sysfs entry called /sys/devices/cpu/perf_event_mux_interval_ms. This value cannot be lowered beyond a point, as can be seen here.

    1. Monitoring upto 8 hardware or hardware-cache events would require hyperthreading to be disabled. Note that, the information about the number of general purpose counters available are retrieved using the CPUID instruction and the number of such counters are setup at the architecture initialization portion of the kernel startup via the early_initcall function. This can be seen here. Once the initialization is done, the kernel understands that only 4 counters are available, and any changes in hyperthreading capabilities later, do not make any difference.