x86cpu-architecturecpu-cachemicro-architecturecpu-mds

How do the store buffer and Line Fill Buffer interact with each other?


I was reading the MDS attack paper RIDL: Rogue In-Flight Data Load. They discuss how the Line Fill Buffer can cause leakage of data. There is the About the RIDL vulnerabilities and the "replaying" of loads question that discusses the micro-architectural details of the exploit.

One thing that isn't clear to me after reading that question is why we need a Line Fill Buffer if we already have a store buffer.

John McCalpin discusses how the store buffer and Line Fill Buffer are connected in How does WC-buffer relate to LFB? on the Intel forums, but that doesn't really make things clearer to me.

For stores to WB space, the store data stays in the store buffer until after the retirement of the stores. Once retired, data can written to the L1 Data Cache (if the line is present and has write permission), otherwise an LFB is allocated for the store miss. The LFB will eventually receive the "current" copy of the cache line so that it can be installed in the L1 Data Cache and the store data can be written to the cache. Details of merging, buffering, ordering, and "short cuts" are unclear.... One interpretation that is reasonably consistent with the above would be that the LFBs serve as the cacheline-sized buffers in which store data is merged before being sent to the L1 Data Cache. At least I think that makes sense, but I am probably forgetting something....

I've just recently started reading up on out-of-order execution so please excuse my ignorance. Here is my idea of how a store would pass through the store buffer and Line Fill Buffer.

  1. A store instruction get scheduled in the front-end.
  2. It executes in the store unit.
  3. The store request is put in the store buffer (an address and the data)
  4. An invalidate read request is sent from the store buffer to the cache system
  5. If it misses the L1d cache, then the request is put in the Line Fill Buffer
  6. The Line Fill Buffer forwards the invalidate read request to L2
  7. Some cache receives the invalidate read and sends its cache line
  8. The store buffer applies its value to the incoming cache line
  9. Uh? The Line Fill Buffer marks the entry as invalid

enter image description here

Questions

  1. Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests?
  2. Is the ordering of events correct in my description?

Solution

  • Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests?

    The store buffer is used to track stores, in order, both before they retire and after they retire but before they commit to the L1 cache2. The store buffer conceptually is a totally local thing which doesn't really care about cache misses. The store buffer deals in "units" of individual stores of various sizes. Chips like Intel Skylake have store buffers of 50+ entries.

    The line fill buffers primary deal with both loads and stores that miss in the L1 cache. Essentially, it is the path from the L1 cache to the rest of the memory subsystem and deals in cache line sized units. We don't expect the LFB to get involved if the load or store hits in the L1 cache1. Intel chips like Skylake have many fewer LFB entries, probably 10 to 12 (testing points to 12 for Skylake).

    Is the ordering of events correct in my description?

    Pretty close. Here's how I'd change your list:

    1. A store instructions gets decoded and split into store-data and store-address uops, which are renamed, scheduled and have a store buffer entry allocated for them.

    2. The store uops execute in any order or simultaneously (the two sub-items can execute in either order depending mostly on which has its dependencies satisfied first).

      1. The store data uop writes the store data into the store buffer.
      2. The store address uop does the V-P translation and writes the address(es) into the store buffer.
    3. At some point when all older instructions have retired, the store instruction retires. This means that the instruction is no longer speculative and the results can be made visible. At this point, the store remains in the store buffer and is called a senior store.

    4. The store now waits until it is at the head of the store buffer (it is the oldest not committed store), at which point it will commit (become globally observable) into the L1, if the associated cache line is present in the L1 in MESIF Modified or Exclusive state. (i.e. this core owns the line)

    5. If the line is not present in the required state (either missing entirely, i.e,. a cache miss, or present but in a non-exclusive state), permission to modify the line and the line data (sometimes) must be obtained from the memory subsystem: this allocates an LFB for the entire line, if one is not already allocated. This is a so-called request for ownership (RFO), which means that the memory hierarchy should return the line in an exclusive state suitable for modification, as opposed to a shared state suitable only for reading (this invalidates copies of the line present in any other private caches).

      An RFO to convert Shared to Exclusive still has to wait for a response to make sure all other caches have invalidated their copies. The response to such an invalidate doesn't need to include a copy of the data because this cache already has one. It can still be called an RFO; the important part is gaining ownership before modifying a line.

    6. In the miss scenario the LFB eventually comes back with the full contents of the line, which is committed to the L1 and the pending store can now commit3.

    This is a rough approximation of the process. Some details may differ on some or all chips, including details which are not well understood.

    As one example, in the above order, the store miss lines are not fetched until the store reaches the head of the store queue. In reality, the store subsystem may implement a type of RFO prefetch where the store queue is examined for upcoming stores and if the lines aren't present in L1, a request is started early (the actual visible commit to L1 still has to happen in order, on x86, or at least "as if" in order).

    So the request and LFB use may occur as early as when step 3 completes (if RFO prefetch applies only after a store retires), or perhaps even as early as when 2.2 completes, if junior stores are subject to prefetch.

    As another example, step 6 describes the line coming back from the memory hierarchy and being committed to the L1, then the store commits. It is possible that the pending store is actually merged instead with the returning data and then that is written to L1. It is also possible that the store can leave the store buffer even in the miss case and simply wait in the LFB, freeing up some store buffer entries.


    1 In the case of stores that hit in the L1 cache, there is a suggestion that the LFBs are actually involved: that each store actually enters a combining buffer (which may just be an LFB) prior to being committed to the cache, such that a series of stores targeting the same cache line get combined in the cache and only need to access the L1 once. This isn't proven but in any case it is not really part of the main use of LFBs (more obvious from the fact we can't even really tell if it is happening or not).

    2 The buffers that hold stores before and retirement might be two entirely different structures, with different sizes and behaviors, but here we'll refer to them as one structure.

    3 The described scenarios involves the store that misses waiting at the head of the store buffer until the associated line returns. An alternate scenario is that the store data is written into the LFB used for the request, and the store buffer entry can be freed. This potentially allows some subsequent stores to be processed while the miss is in progress, subject to the strict x86 ordering requirements. This could increase store MLP.