x86intelperformancecountercpu-cacheintel-pmu

Why are the user-mode L1 store miss events only counted when there is a store initialization loop?


Summary

Consider the following loop:

loop:
movl   $0x1,(%rax)
add    $0x40,%rax
cmp    %rdx,%rax
jne    loop

where rax is initialized to the address of a buffer that is larger than the L3 cache size. Every iteration performs a store operation to the next cache line. I expect that the number of RFO requests sent from the L1D to the L2 to be more or less equal to the number of cache lines accessed. The problem is that this seems to be only the case when I count kernel-mode events even though the program runs in user-mode, except in one case as I discuss below. The way the buffer is allocated does not seem to matter (.bss, .data, or from the heap).

Details

The results of my experiments are shown in the tables below. All of the experiments are performed on processors with hyperthreading disabled and all hardware prefetchers enabled.

I've tested the following three cases:

The following table shows the results on an Intel CFL processor. These experiments have been performed on Linux kernel version 4.4.0.

enter image description here

The following table shows the results on an Intel HSW processor. Note that the events L2_RQSTS.PF_HIT, L2_RQSTS.PF_MISS, and OFFCORE_REQUESTS.ALL_REQUESTS are not documented for HSW. These experiments have been performed on Linux kernel version 4.15.

enter image description here

The first column of each table contains the names of the performance monitoring events whose counts are the shown in the other columns. In the column labels, the letters U and K represent user-mode and kernel-mode events, respectively. For the cases that have two loops, the numbers 1 and 2 are used to refer to the initialization loop and the main loop, respectively. For example, LoadInit-1K represents the kernel-mode counts for the initialization loop of the LoadInit case.

The values shown in the tables are normalized by the number of cache lines. They are also color-coded as follows. The darker the green color is the larger the value is with respect to all other cells in the same table. However, the last three rows of the CFL table and the last two rows of the HSW table are not color-coded because some of the values in these rows are too large. These rows are painted in dark gray to indicate that they are not color-coded like the other rows.

I expect that the number of user-mode L2_RQSTS.ALL_RFO events to be equal to the number of cache lines accessed (i.e., a normalized value of 1). This event is described in the manual as follows:

Counts the total number of RFO (read for ownership) requests to L2 cache. L2 RFO requests include both L1D demand RFO misses as well as L1D RFO prefetches.

It says that L2_RQSTS.ALL_RFO may not only count demand RFO requests from the L1D but also L1D RFO prefetches. However, I've observed that the event count is not affected by whether the L1D prefetchers are enabled or disabled on both processors. But even if the L1D prefetchers may generated RFO prefetches, the event count then should be at least as large as the number of cache lines accessed. As can be seen from both tables, this is only the case in StoreInit-2U. The same observation applies to all of the events show in the tables.

However, the kernel-mode counts of the events are about equal to what the user-mode counts are expected to be. This is in contrast to, for example, MEM_INST_RETIRED.ALL_STORES (or MEM_UOPS_RETIRED.ALL_STORES on HSW), which works as expected.

Due to the limited number of PMU counter registers, I had to divide all the experiments into four parts. In particular, the kernel-mode counts are produced from different runs than the user-mode counts. It doesn't really matter what is being counted in the same. I think it's important to tell you this because this explains why some user-mode counts are a little larger than the kernel-mode counts of the same events.

The events shown in dark gray seem to overcount. The 4th gen and 8th gen Intel processor specification manuals do mention (problem HSD61 and 111, respectively) that OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO may overcount. But these results indicate that it may be overcounted by many times, not by just a couple of events.

There are other interesting observations, but they are not pertinent to the question, which is: why are the RFO counts not as expected?


Solution

  • You didn't flag your OS, but let's assume you are using Linux. This stuff would be different on another OS (and perhaps even within various variants of the same OS).

    On a read access to an unmapped page, the kernel page fault handler maps in a system-wide shared zero page, with read-only permissions.

    This explains columns LoadInit-1U|K: even though your init load is striding over a virtual area of 64 MB performing loads, only a single physical 4K page filled with zeros is mapped, so you get approximately zero cache misses after the first 4KB, which rounds to zero after your normalization.1

    On a write access to an unmapped page, or to the read-only shared zero page, the kernel will map a a new unique page on behalf of the process. This new page is guaranteed to be zeroed, so unless the kernel has some known-to-be-zero pages hanging around, this involves zeroing the page (effectively memset(new_page, 0, 4096)) prior to mapping it.

    That largely explains the remaining columns except for StoreInit-2U|K. In those cases, even though it seems like the user program is doing all the stores, the kernel ends up doing all of the hard work (except for one store per page) since as the user process faults in each page, the kernel writes zeros to it, which has the side effect of bringing all the pages into the L1 cache. When the fault handler returns, the triggering store and all subsequent stores for that page will hit in the L1 cache.

    It still doesn't fully explain StoreInit-2. As clarified in the comments, the K column actually includes the user counts, which explains that column (subtracting out the user counts leaves it at roughly zero for every event, as expected). The remaining confusion is why L2_RQSTS.ALL_RFO is not 1 but some smaller value like 0.53 or 0.68. Maybe the event is undercounting, or there is some micro-architectural effect that we're missing, like a type of prefetch that prevents the RFO (for example, if the line is loaded into the L1 by some type of load operation before the store, the RFO won't occur). You could try to include the other L2_RQSTS events to see if the missing events show up there.

    Variations

    It doesn't need to be like that on all systems. Certainly other OSes may have different strategies, but even Linux on x86 might behave differently based on various factors.

    For example, rather than the 4K zero page, you might get allocated a 2 MiB huge zero page. That would change the benchmark since 2 MiB doesn't fit in L1, so the LoadInit tests will probably show misses in user-space on the first and second loops.

    More generally, if you were using huge pages, the page fault granularity would be changed from 4 KiB to 2 MiB, meaning that only a small part of the zeroed page would remain in L1 and L2, so you'd get L1 and L2 misses, as you expected. If your kernel ever implements fault-around for anonymous mappings (or whatever mapping you are using), it could have a similar effect.

    Another possibility is that the kernel may zero pages in the background and so have zero pages ready. This would remove the K counts from the tests, since the zeroing doesn't happen during the page fault, and would probably add the expected misses to the user counts. I'm not sure if the Linux kernel ever did this or has the option to do it, but there were patches floating around. Other OSes like BSD have done it.

    RFO Prefetchers

    About "RFO prefetchers" - the RFO prefetchers are not really prefetchers in the usual sense and they are unrelated to the L1D prefetchers can be turned off. As far as I know "RFO prefetching" from the L1D simply refers to sending an RFO request either for (a) a store when its address is calculated (i.e., when the store data uop executes), but before it retires or (b) for stores in the store buffer which are nearing but have not reached the head of the store buffer.

    Obviously when a store gets to the head of the buffer, it's time to send an RFO, and you wouldn't call that a prefetch - but why not send some requests for the second-from-the-head store too, and so on (case b)? Or why not check the L1D as soon as the store address is known (as a load would) and then issue a speculative RFO prefetch if it misses? These may be known as RFO prefetches, but they differ from a normal prefetch in that the core knows the address that has been requested: it is not a guess.

    There is speculation in the sense that getting additional lines other than the current head may be wasted work if another core sends an RFO for that line before the core has a chance to write from it: the request was useless in that case and just increased coherency traffic. So there are predictors that may reduce this store buffer prefetch if it fails too often. There may also be speculation in the sense that store buffer prefetch may sent requests for junior stores which haven't retired, at the cost of a useless request if the store ends up being on a bad path. I'm not actually sure if current implementations do that.


    1 This behavior actually depends on the details of the L1 cache: current Intel VIPT implementations allow multiple virutal aliases of the same single line to all live happily in L1. Current AMD Zen implementations use a different implementation (micro-tags) which don't allow the L1 to logically contain multiple virtual aliases, so I would expect Zen to miss to L2 in this case.