[SOLVED] What Store/Store reordering do modern CPUs do in practice?

What Store/Store reordering do modern CPUs do in practice?

Aarch64 and RISC-V WMO seem to allow Store/Store reordering according to their formal specifications.

However, Store/Store reordering seems very tricky to perform in practice: the CPU would need to guarantee that no exception/mispredict occurs between the two stores.

So I'm curious if there are examples of CPUs that can reorder stores, or if it's basically the case that stores always retire in-order.

Solution

Real-world AArch64 CPUs routinely do such reordering. The following litmus test shows frequent reordering on both Cortex A-76 and Apple M3 CPUs. Compile with either clang++ or g++, using -O3 -std=c++20. (Godbolt link)

#include <atomic>
#include <thread>
#include <iostream>

alignas(256) std::atomic<long> x{0}, y{0};

void writer() {
    long c = 0;
    while (true) {
        x.store(c, std::memory_order_relaxed);
        y.store(c, std::memory_order_relaxed);
        c++;
    }
}

void reader() {
    while (true) {
        long ytmp = y.load(std::memory_order_acquire);
        long xtmp = x.load(std::memory_order_acquire);
        if (ytmp > xtmp) {
            std::cout << "reorder: y = " << ytmp
                      << ", x = " << xtmp << std::endl;
        }
    }
}

int main() {
    std::thread t1(writer), t2(reader);
    t1.join();
    t2.join();
    return 0;
}

Typical output:

reorder: y = 11241, x = 11234
reorder: y = 52340, x = 52338
reorder: y = 71663, x = 71634
reorder: y = 75433, x = 75396
reorder: y = 76560, x = 76544

Changing memory_order_relaxed to memory_order_release gives no output, as expected.

Operative parts of the assembly:

writer_loop:
        str     x8, [x]
        str     x8, [y]
        add     x8, x8, #1
        b       writer_loop

reader_loop:
        ldar    x23, [y]   // or ldapr with appropriate -march flags
        ldar    x22, [x]   // ditto
        cmp     x23, x22
        b.le    reader_loop
        // else print message about reordering

However, Store/Store reordering seems very tricky to perform in practice: the CPU would need to guarantee that no exception/mispredict occurs between the two stores.

No, it's not that difficult.

The typical mechanism for Store/Store reordering is a store buffer. An entry is created in the store buffer when a store instruction executes speculatively, and any later loads on this core from that address will be fulfilled from the store buffer (store forwarding). If and when the store instruction retires (becomes non-speculative), the store buffer entry is marked as "graduated". It can be committed to L1d cache at any time from then on. If its cache line is not currently owned by this core, the entry continues to be held in the store buffer until the cache line becomes available, at which point it will be committed in the background. In the meantime, the core continues executing and retiring subsequent instructions. This can include later store instructions, causing their store buffer entries to graduate as well.

But the key point is that a store commits after its instruction retires, at which point we already know for certain that no exception occurred to prevent it from doing so. You seem to be imagining a model where a store commits before its instruction retires, based on the CPU peeking into the future and foreseeing that no exception will occur before then. That's not what happens.

(The flip side of this is that once a store buffer entry graduates, the core must ensure that it commits to L1d cache in finite time, come hell or high water. In particular, if an exception or interrupt is taken on a later cycle, it can wipe out the ungraduated entries (which were speculative and now must not ever been seen by other cores), but the graduated ones must stay. If an exception is a synchronizing event on this architecture, or if the exception handler chooses to execute a barrier, then the store buffer may be drained: the core simply stalls execution of any further instructions until all graduated entries have committed.)

So all that needs to happen for Store/Store reordering is a scenario like the following:

Store to x retires. Store buffer entry graduates, but x cache line is not available, so it waits in the store buffer. Execution proceeds.
Store to y retires. Its store buffer entry graduates. x store buffer entry is still waiting.
Cache line for y becomes available while x is still waiting. Store to y commits.
Some time later, x cache line finally becomes available and it commits.

The x and y store instructions retired in program order, as they must; but they committed to L1d cache, and thus became visible to other cores, in the opposite order.

A TSO architecture like x86 can have a store buffer too, with the only difference being that entries commit in strict FIFO order. In that model, an entry added later must wait for earlier entries, even if its cache line is available sooner.