x86 intel cpu-architecture memory-barriers mesi

How do modern Intel x86 CPUs implement the total order over stores

x86 guarantees a total order over all stores due to its TSO memory model. My question is if anyone has an idea how this is actually implemented.

I have a good impression how all the 4 fences are implemented, so I can explain how local order is preserved. But the 4 fences will just give Program Order; it won't give you TSO (I know TSO allows older stores to jump in front of newer loads so only 3 out of 4 fences are implicitly needed).

Total order over all memory actions over a single address is responsibility of coherence. But I would like know how Intel (Skylake in particular) implements a total order on stores over multiple addresses.

Solution

The x86 TSO memory model basically amounts to program-order plus a store buffer with store-forwarding. (486 hardware was that simple; later CPUs didn't introduce new reordering.)

Most of the resulting guarantees are fairly easy in theory for hardware to implement by simply having a store buffer and coherent shared memory; a store buffer insulates OoO exec from the in-order commit requirement (and from cache-miss stores), and makes it possible to speculatively execute stores, and (via store->load forwarding) reloads of those stores while they're still speculative.

All cores can agree on a total order in which all stores happened. Or more accurately, cores can't disagree on any part of the total order they can actually observe. Stores to 2 different lines can be truly simultaneous, so any observations are compatible with either order in a hypothetical total order.

This happens automatically if the only way to make a store visible to any other core makes it visible to all cores simultaneously. i.e. by committing to coherent L1d. This makes IRIW reordering impossible. (MESI ensures that a store can't commit to L1d unless it's exclusively owned by this core: no other cores have a valid copy.) (A core observing its own stores needs a full barrier or it will observe its own stores via store forwarding, not the global total order. Typical IRIW litmus tests are considering 4 total threads so no local reloads.)

In fact it's rare for any hardware not to have this property; some POWER CPUs can store-forward between SMT threads on the same physical core, making it possible for 2 readers to disagree about the order of stores by 2 writers (IRIW reordering). Even though x86 CPUs also often have SMT (e.g. Intel's HyperThreading), the memory model requires them not to store-forward between logical cores. That's fine; they statically partition the store buffer anyway. What will be used for data exchange between threads are executing on one Core with HT?. And also What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? for experimental testing.

The only reordering that happens is local, within each CPU core, between its accesses to that globally coherent shared state. (That's why local memory barriers that just make this core wait for stuff to happen, e.g. for the store buffer to drain, can recover sequential consistency on top of x86 TSO. The same applies even to weaker memory models, BTW: just local reordering on top of MESI coherency.)

The rest of these guarantees apply to each (logical) CPU core individually. (Q&A about how this can create synchronization between cores.)

Stores become visible in program order: in-order commit from the store buffer to L1d cache. (Store buffer entries are allocated in program order during issue/rename). This means cache miss stores must stall the store buffer, not letting younger stores commit. See Why doesn't RFO after retirement break memory ordering? for a simple mental model of this, and some detail on what Skylake may actually do (with committing data from store misses into LFBs while waiting for the cache lines to arrive).
Loads don't reorder with later stores: easy: require loads to fully complete (have taken data from L1d cache) before they can retire. Since retirement is in order, and a store can't commit to L1d until after it retires (becomes non-speculative), we get LoadStore ordering for free¹.
Loads take data from coherent cache (memory) in program order. This is the hard one: loads access global state (cache) when they execute, unlike stores where the store buffer can absorb the mismatch between OoO exec and in-order commit. Actually making every load dependent on previous loads would prevent hit-under-miss and kill a lot of the benefits of out-of-order execution for code that involved memory.

In practice, Intel CPUs speculate aggressively that a cache line that's present now will still be present when it's architecturally allowed for the load to happen (after earlier loads execute). If that isn't the case, nuke the pipeline (memory order mis-speculation). There's a perf counter event for this.

In practice everything can be more complicated to chase a bit more performance, or a lot more for speculative early loads.

(In C++ terms, this is at least as strong as acq_rel, but also covers behaviour of things that might be UB in C++. For example, a load partially overlapping a recent store to a location another thread might also be reading or writing, allowing this core to load a value that never appeared or will appear in memory for other threads to load. Globally Invisible load instructions)

related Q&As:

What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? causes machine_clears.memory_ordering
C++ How is release-and-acquire achieved on x86 only using MOV? - MESI cache coherency is a key part of all this.
how are barriers/fences and acquire, release semantics implemented microarchitecturally?
C++ How is release-and-acquire achieved on x86 only using MOV?
Globally Invisible load instructions
Why flush the pipeline for Memory Order Violation caused by other logical processors?
How does memory reordering help processors and compilers?
Does a memory barrier ensure that the cache coherence has been completed? - that's not even the right mental model.

Footnote 1:
Some OoO exec weakly-ordered CPUs can do LoadStore reordering, presumably by letting loads retire from the ROB as long as the load checked permissions and requested the cache line (for a miss), even if the data hasn't actually arrived yet. Some separate tracking of the register not being ready is needed, not the usual instruction scheduler.

LoadStore reordering is actually easier to understand on an in-order pipeline, where we know special handling for cache-miss loads is needed for acceptable performance. How is load->store reordering possible with in-order commit?