x86intelcpu-architecturememory-barriersmesi

How do modern Intel x86 CPUs implement the total order over stores


x86 guarantees a total order over all stores due to its TSO memory model. My question is if anyone has an idea how this is actually implemented.

I have a good impression how all the 4 fences are implemented, so I can explain how local order is preserved. But the 4 fences will just give Program Order; it won't give you TSO (I know TSO allows older stores to jump in front of newer loads so only 3 out of 4 fences are implicitly needed).

Total order over all memory actions over a single address is responsibility of coherence. But I would like know how Intel (Skylake in particular) implements a total order on stores over multiple addresses.


Solution

  • The x86 TSO memory model basically amounts to program-order plus a store buffer with store-forwarding. (486 hardware was that simple; later CPUs didn't introduce new reordering.)

    Most of the resulting guarantees are fairly easy in theory for hardware to implement by simply having a store buffer and coherent shared memory; a store buffer insulates OoO exec from the in-order commit requirement (and from cache-miss stores), and makes it possible to speculatively execute stores, and (via store->load forwarding) reloads of those stores while they're still speculative.

    The only reordering that happens is local, within each CPU core, between its accesses to that globally coherent shared state. (That's why local memory barriers that just make this core wait for stuff to happen, e.g. for the store buffer to drain, can recover sequential consistency on top of x86 TSO. The same applies even to weaker memory models, BTW: just local reordering on top of MESI coherency.)

    The rest of these guarantees apply to each (logical) CPU core individually. (Q&A about how this can create synchronization between cores.)

    In practice everything can be more complicated to chase a bit more performance, or a lot more for speculative early loads.

    (In C++ terms, this is at least as strong as acq_rel, but also covers behaviour of things that might be UB in C++. For example, a load partially overlapping a recent store to a location another thread might also be reading or writing, allowing this core to load a value that never appeared or will appear in memory for other threads to load. Globally Invisible load instructions)

    related Q&As:


    Footnote 1:
    Some OoO exec weakly-ordered CPUs can do LoadStore reordering, presumably by letting loads retire from the ROB as long as the load checked permissions and requested the cache line (for a miss), even if the data hasn't actually arrived yet. Some separate tracking of the register not being ready is needed, not the usual instruction scheduler.

    LoadStore reordering is actually easier to understand on an in-order pipeline, where we know special handling for cache-miss loads is needed for acceptable performance. How is load->store reordering possible with in-order commit?