arm64riscvmemory-barriersmemory-model

What Store/Store reordering do modern CPUs do in practice?


Aarch64 and RISC-V WMO seem to allow Store/Store reordering according to their formal specifications.

However, Store/Store reordering seems very tricky to perform in practice: the CPU would need to guarantee that no exception/mispredict occurs between the two stores.

So I'm curious if there are examples of CPUs that can reorder stores, or if it's basically the case that stores always retire in-order.


Solution

  • Real-world AArch64 CPUs routinely do such reordering. The following litmus test shows frequent reordering on both Cortex A-76 and Apple M3 CPUs. (Compile with either clang++ or g++, using -O3 -std=c++20.)

    #include <atomic>
    #include <thread>
    #include <iostream>
    
    alignas(256) std::atomic<long> x{0}, y{0};
    
    void writer() {
        long c = 0;
        while (true) {
            x.store(c, std::memory_order_relaxed);
            y.store(c, std::memory_order_relaxed);
            c++;
        }
    }
    
    void reader() {
        while (true) {
            long ytmp = y.load(std::memory_order_acquire);
            long xtmp = x.load(std::memory_order_acquire);
            if (ytmp > xtmp) {
                std::cout << "reorder: y = " << ytmp
                          << ", x = " << xtmp << std::endl;
            }
        }
    }
    
    int main() {
        std::thread t1(writer), t2(reader);
        t1.join();
        t2.join();
        return 0;
    }
    

    Typical output:

    reorder: y = 11241, x = 11234
    reorder: y = 52340, x = 52338
    reorder: y = 71663, x = 71634
    reorder: y = 75433, x = 75396
    reorder: y = 76560, x = 76544
    

    Changing memory_order_relaxed to memory_order_release gives no output, as expected.


    However, Store/Store reordering seems very tricky to perform in practice: the CPU would need to guarantee that no exception/mispredict occurs between the two stores.

    As I understand it, the typical mechanism for Store/Store reordering is a store buffer, from which entries are allowed to commit as soon as their cache lines become available. A store is put into the store buffer only when it retires (but see below). So by the time the stores are both in the store buffer, they're already non-speculative and the CPU knows that no exception occurred. It just needs to make sure that all stores in the buffer do inevitably commit in finite time, regardless of what may happen later.

    (Alternatively, a store could actually be put in the store buffer while still speculative, to facilitate out-of-order execution, but its entry would need to be marked to not commit to L1d cache until after it is non-speculative.)

    A TSO architecture like x86 can have a store buffer too, with the only difference being that entries commit in strict FIFO order. So an entry added later must wait for earlier entries, even if its cache line is available sooner.

    So I'm curious if there are examples of CPUs that can reorder stores, or if it's basically the case that stores always retire in-order.

    The point is that a store does not necessarily commit to L1 cache and become visible to other cores at the moment that it retires; thanks to store buffering, it can commit some time later. So even if stores do retire in-order, that does not imply that they become globally visible in-order.