assembly x86 x86-64 cpu-architecture memory-model

Understanding Memory Controller RPQ/WPQ ordering guarantees for loads and ntstores

I'm trying to understand how memory controllers maintain program order between non-temporal loads and non-temporal stores when there's significant queue pressure disparity between RPQ (Read Pending Queue) and WPQ (Write Pending Queue).

Consider this sequence:

load A    // goes to RPQ
ntstore A // goes to WPQ

If RPQ has many pending entries (say 20) and WPQ is relatively empty (say 2 entries), intuitively it seems the store could reach DRAM before the load completes:

Load A gets queued behind 20 other reads in RPQ
ntstore A enters nearly empty WPQ and could complete quickly
This would violate program order as the store would complete before its preceding load

I wrote a test program (see below) to verify this hypothesis. Which:

Creates heavy RPQ pressure while keeping WPQ relatively empty
Issues load followed by ntstore to same address
Checks if load sees the value written by ntstore (which would indicate ordering violation)
Prevents caching effects by separating RPQ fill addresses from test addresses

The core test sequence:

uint64_t val = MAGIC_A;
uint64_t addr = (uint64_t)&memory[test_idx].sentinel;

asm volatile(
    "mov (%1), %%rax\n\t"      // Load into rax
    "movnti %%rbx, (%1)\n\t"   // NT Store
    "mov %%rax, %0\n\t"        // Save loaded value
    : "=r"(val)
    : "r"(addr), "b"(MAGIC_B)
    : "rax", "memory"
);

if (val == MAGIC_B) {  // Would indicate store completed before load
    local_violations++;
}

Results: Running with 16 threads and 10M cache lines per thread, we see 0 violations. Performance counters confirm we achieved the desired queue pressure:

$ sudo perf stat -e uncore_imc_0/unc_m_rpq_occupancy/ -e uncore_imc_0/unc_m_rpq_inserts/ -e uncore_imc_0/unc_m_wpq_occupancy/ -e uncore_imc_0/unc_m_wpq_inserts/ -e uncore_imc_3/unc_m_rpq_occupancy/ -e uncore_imc_3/unc_m_rpq_inserts/ -e uncore_imc_3/unc_m_wpq_occupancy/ -e uncore_imc_3/unc_m_wpq_inserts/ sleep 60
[sudo] password for vin: 

 Performance counter stats for 'system wide':

 2,893,410,795,007      uncore_imc_0/unc_m_rpq_occupancy/                                      
     9,443,033,953      uncore_imc_0/unc_m_rpq_inserts/                                       
   574,954,888,344      uncore_imc_0/unc_m_wpq_occupancy/                                      
        32,101,285      uncore_imc_0/unc_m_wpq_inserts/                                       
     1,086,622,871      uncore_imc_3/unc_m_rpq_occupancy/                                      
        38,269,189      uncore_imc_3/unc_m_rpq_inserts/                                       
    76,056,378,805      uncore_imc_3/unc_m_wpq_occupancy/                                      
        31,895,245      uncore_imc_3/unc_m_wpq_inserts/                                       

      60.002128565 seconds time elapsed

How does the memory controller maintain program order in this scenario? Given the significant disparity in queue depths (RPQ ~306 vs WPQ ~18), what mechanisms prevent the store from completing before its preceding load? I suspect there must be some ordering mechanism beyond simple queue dynamics, but I don't understand what it is.

Here is the complete code: https://gist.github.com/VinayBanakar/6841e553d274fa5b8a156c13937405c8

Solution

On x86 (unlike ARM and others), I'm pretty sure loads can't retire (become non-speculative and allow later insns to also retire) until a value is returned. This lets the CPU catch memory-order mis-speculation, since hardware actually loads early and checks for things like LoadLoad ordering violations and nukes the pipeline if necessary. (machine_clears.memory_ordering perf event).

Weakly-ordered ISAs don't need to do that; they can let a load retire from the ROB (ReOrder Buffer) once its known to be non-faulting and the request has been sent. So it's only tracked by a load buffer entry, not also a ROB entry.

A store can't commit from the store buffer to an LFB or L1d cache until after it becomes non-speculative, otherwise that could make mis-speculated store values visible to other cores where they couldn't be rolled back. (e.g. after detecting an earlier branch mispredict or faulting instruction.)

So x86 out-of-order exec hardware is fundamentally incapable of LoadStore reordering, even with weakly-ordered NT stores. Your off-core requests to the memory controller can't be in-flight at once.

SSE4.1 movntdqa loads from WC memory are weakly ordered (unlike movntdqa loads from any other memory type - unlike NT stores the instruction doesn't override the ordering semantics). Hypothetically, they could be allowed to retire before the data arrives (before a response to the off-core request). That wouldn't violate the memory model because they're allowed to reorder freely with earlier or later loads and stores (I think), and potentially need a full mfence to block their reordering. I don't know if any real CPUs let them retire while data is still in flight, or if they all handle them like normal loads except not checking for memory-order mis-speculation. I'd guess the latter since it's a pretty rare use-case and the benefit would probably be minor.

But you're using plain mov loads on C++ objects you allocated normally (globals or new via std::vector) which will be in WB (Write Back) memory regions, so none of this applies to your testcase.

Your load and store are to the same address, `(%1)`

It's architecturally required that loads and stores to the same address from the same logical core (thread) don't reorder. I'm pretty sure this guarantee extends even to weakly-ordered NT loads from WC memory but I didn't double-check.

It's certainly true for plain loads/stores even on weakly-ordered ISAs like PowerPC. All current mainstream CPUs implement C++'s read-write coherency guarantee for free, no asm barriers needed between shared.load(relaxed); shared.store(newval, relaxed);, with no possibility of the load seeing the store.

Memory disambiguation within the core would see that the store is younger so it must not store-forward from the store to the load, and any off-core reordering in a memory controller would have to be designed to support sufficient ordering. Perhaps with sequence numbers attached to requests? I'm not sure how cache-coherent interconnects between CPUs and memory controllers, and memory controllers themselves, handle this. I've read that maintaining sufficient ordering in the interconnect between cores doesn't happen automatically, it's something CPU architects have to build in.

Understanding Memory Controller RPQ/WPQ ordering guarantees for loads and ntstores

Your load and store are to the same address, (%1)

Your load and store are to the same address, `(%1)`