I'm trying to understand how memory controllers maintain program order between non-temporal loads and non-temporal stores when there's significant queue pressure disparity between RPQ (Read Pending Queue) and WPQ (Write Pending Queue).
Consider this sequence:
load A // goes to RPQ
ntstore A // goes to WPQ
If RPQ has many pending entries (say 20) and WPQ is relatively empty (say 2 entries), intuitively it seems the store could reach DRAM before the load completes:
I wrote a test program (see below) to verify this hypothesis. Which:
The core test sequence:
uint64_t val = MAGIC_A;
uint64_t addr = (uint64_t)&memory[test_idx].sentinel;
asm volatile(
"mov (%1), %%rax\n\t" // Load into rax
"movnti %%rbx, (%1)\n\t" // NT Store
"mov %%rax, %0\n\t" // Save loaded value
: "=r"(val)
: "r"(addr), "b"(MAGIC_B)
: "rax", "memory"
);
if (val == MAGIC_B) { // Would indicate store completed before load
local_violations++;
}
Results: Running with 16 threads and 10M cache lines per thread, we see 0 violations. Performance counters confirm we achieved the desired queue pressure:
$ sudo perf stat -e uncore_imc_0/unc_m_rpq_occupancy/ -e uncore_imc_0/unc_m_rpq_inserts/ -e uncore_imc_0/unc_m_wpq_occupancy/ -e uncore_imc_0/unc_m_wpq_inserts/ -e uncore_imc_3/unc_m_rpq_occupancy/ -e uncore_imc_3/unc_m_rpq_inserts/ -e uncore_imc_3/unc_m_wpq_occupancy/ -e uncore_imc_3/unc_m_wpq_inserts/ sleep 60
[sudo] password for vin:
Performance counter stats for 'system wide':
2,893,410,795,007 uncore_imc_0/unc_m_rpq_occupancy/
9,443,033,953 uncore_imc_0/unc_m_rpq_inserts/
574,954,888,344 uncore_imc_0/unc_m_wpq_occupancy/
32,101,285 uncore_imc_0/unc_m_wpq_inserts/
1,086,622,871 uncore_imc_3/unc_m_rpq_occupancy/
38,269,189 uncore_imc_3/unc_m_rpq_inserts/
76,056,378,805 uncore_imc_3/unc_m_wpq_occupancy/
31,895,245 uncore_imc_3/unc_m_wpq_inserts/
60.002128565 seconds time elapsed
How does the memory controller maintain program order in this scenario? Given the significant disparity in queue depths (RPQ ~306 vs WPQ ~18), what mechanisms prevent the store from completing before its preceding load? I suspect there must be some ordering mechanism beyond simple queue dynamics, but I don't understand what it is.
Here is the complete code: https://gist.github.com/VinayBanakar/6841e553d274fa5b8a156c13937405c8
On x86 (unlike ARM and others), I'm pretty sure loads can't retire (become non-speculative and allow later insns to also retire) until a value is returned. This lets the CPU catch memory-order mis-speculation, since hardware actually loads early and checks for things like LoadLoad ordering violations and nukes the pipeline if necessary. (machine_clears.memory_ordering
perf event).
Weakly-ordered ISAs don't need to do that; they can let a load retire from the ROB (ReOrder Buffer) once its known to be non-faulting and the request has been sent. So it's only tracked by a load buffer entry, not also a ROB entry.
A store can't commit from the store buffer to an LFB or L1d cache until after it becomes non-speculative, otherwise that could make mis-speculated store values visible to other cores where they couldn't be rolled back. (e.g. after detecting an earlier branch mispredict or faulting instruction.)
So x86 out-of-order exec hardware is fundamentally incapable of LoadStore reordering, even with weakly-ordered NT stores. Your off-core requests to the memory controller can't be in-flight at once.
SSE4.1 movntdqa
loads from WC memory are weakly ordered (unlike movntdqa
loads from any other memory type - unlike NT stores the instruction doesn't override the ordering semantics). Hypothetically, they could be allowed to retire before the data arrives (before a response to the off-core request). That wouldn't violate the memory model because they're allowed to reorder freely with earlier or later loads and stores (I think), and potentially need a full mfence
to block their reordering. I don't know if any real CPUs let them retire while data is still in flight, or if they all handle them like normal loads except not checking for memory-order mis-speculation. I'd guess the latter since it's a pretty rare use-case and the benefit would probably be minor.
But you're using plain mov
loads on C++ objects you allocated normally (globals or new
via std::vector
) which will be in WB (Write Back) memory regions, so none of this applies to your testcase.
(%1)
It's architecturally required that loads and stores to the same address from the same logical core (thread) don't reorder. I'm pretty sure this guarantee extends even to weakly-ordered NT loads from WC memory but I didn't double-check.
It's certainly true for plain loads/stores even on weakly-ordered ISAs like PowerPC. All current mainstream CPUs implement C++'s read-write coherency guarantee for free, no asm barriers needed between shared.load(relaxed); shared.store(newval, relaxed);
, with no possibility of the load seeing the store.
Memory disambiguation within the core would see that the store is younger so it must not store-forward from the store to the load, and any off-core reordering in a memory controller would have to be designed to support sufficient ordering. Perhaps with sequence numbers attached to requests? I'm not sure how cache-coherent interconnects between CPUs and memory controllers, and memory controllers themselves, handle this. I've read that maintaining sufficient ordering in the interconnect between cores doesn't happen automatically, it's something CPU architects have to build in.