memory x86 cpu-architecture memory-bandwidth

Which resources of a modern x86 CPU core are occupied by memory transactions in flight?

I want to clarify how modern x86 architectures handle the latency of memory transactions that go all the way to DRAM. Specifically, which resources (which queues) get occupied waiting for the memory transactions in flight? And which resource is typically the most precious, i.e. what can typically become a bottleneck in keeping the transactions in flight concurrently, reduce the latency hiding effect, and limit the utilization of the memory bandwidth by a single CPU core?

For example, I would like to understand what is referred to as "available concurrency" in articles on Little's law like Single-core memory bandwidth: Latency, Bandwidth, and Concurrency, or articles from Chips and Cheese like the Gracemont architecture.

I think that in this case the "available concurrency" should mean the size of the load and store queues in the Load/Store unit of the core. But I am not sure whether I make sense of numbers in a concrete case. So, I'd like to check my understanding, and how large is the effect from the fuzziness in the concrete case, or whether other resources of the core have to be considered too.

The example is the following: the Chips and Cheese article about the Gracemont architecture mentions that the DRAM latency is about 112ns, and the single core bandwidth is about 19.5 GB/s. (On a N150 processor, I see about the same overall bandwidth in a program that does 1 load and 1 store in a loop.) So, the used queue concurrency per 64B L1 cache line should be:

19.5GB/s * 112ns / 64B = 34

And to saturate the bus it would need to maintain 56 L1 cache lines in flight:

32GB/s * 112ns / 64B = 56

The Gracemont micro-architecture diagram shows that the load queue of size is 80, and the store queue size is 50 entries. And the queues manage two 128-bit (16B) loads and stores per cycle. So, does it mean that the entries of the queues are 128-bit i.e. 16B large?

How does it agree with the measured single-core bandwidth?

It is a program that loads data first (misses the cache on load or relies on the HW prefetchers), then stores it back into L1 cache, and the cache eviction takes care of the DRAM store transactions without involving the core. There are 10GB/s of loads and 10GB/s of stores. So it must keep 10GB * 112ns / 16B = 70 of 16B transactions in flight.

Then, does it mean that the program is limited by the load queue size? It is not limited by the 50-entries store queue, because the stores are handled concurrently by the cache eviction. And the burstiness of the issued loads deteriorates the usage of the load queue, so that the queue occupancy stays at 70 instead of closer to 80?

In the Chips and Cheese Gracemont article, they say that the bandwidth "Past L2, Gracemont seems limited by high latency." Bandwidth is limited by latency only when there is not enough concurrency to keep more transactions in flight and get the latency hiding effect. They do mention the shortness of the load/store queues:

If there’s anything we can find fault with, it’s the load and store queues. They’re a touch small and may have a harder time absorbing bursty memory activity, particularly if accesses miss the L1 or L2 caches and maximum reordering capacity is needed.

So, do they mean that when max reordering capacity is needed (full ROB of micro-ops with memory accesses), the load-store queues become a bottleneck because they are just too short. Is that correct?

In general, as far as I understand it, a regular memory access micro-op (let's consider a micro-op from something like mov or add first, and get to the prefetch instruction after that) should also occupy a slot in the Reorder Buffer (ROB) while the memory transaction is in flight. But typically ROB is just bigger than the load/store queues, so that it is never a bottleneck for the bandwidth?

The micro-op also occupies a slot in the corresponding execution scheduler (and related non-scheduling queues). But, when the micro-op is executed, the scheduler should just free it from its queue immediately. So it cannot be a problem. The ROB slot is not freed immediately, because ROB has to wait for the retirement/completion of the memory access micro-op, in case the memory access raises errors. In case of a load, the micro-op also has to actually load the data into an internal physical register. But in case of a store, I guess, the possible errors is the only reason to not retire the micro-op immediately after dispatching to the memory sub-system.

I.e. the execution schedulers and other core units do not wait for a memory access micro-op to complete. Only 2 CPU core units do wait for a memory transaction to complete: the ROB and the Load/Store unit.

I wonder whether it is different for the prefetch instruction? In principle, a software prefetch could be a cheaper micro-op: issue a transaction to the load/store unit and immediately retire to free the ROB slot. It could also trigger more memory traffic (more cache lines transfered) than a usual micro-op of a regular width. Which could be useful in controlling the memory bus saturation without needing a lot of concurrency in the queues.

Update: some PMU numbers on cache and memory behavior at different state of the prefetchers. It is measured on the same program as in this question, using the same N150 Intel processor, i.e. one cluster of Gracemont E-cores. The program runs over 1GB array performing a trivial operation on each element and writing it back. It makes 32 repetitions to have more PMU statistics. It is compiled with -O3 -march=gracemont -fno-unroll-loops. And it reaches 20GB/s of the DRAM bandwidth, 10GB/s of loads and 10GB/s of stores. Intel VTune reports that according to its micro-benchmark, the max bandwidth is 25GB/s.

I changed the prefetcher state with (using MSR 0x1a4):

sudo wrmsr --all 0x1a4 0x804  # default value, all prefetchers on
sudo wrmsr --all 0x1a4 0x80c  # L2 prefetcher on, L1 IP off
sudo wrmsr --all 0x1a4 0x805  # L2 off, L1 IP on
sudo wrmsr --all 0x1a4 0x80d  # L2 off, and L1 IP off

And I looked at events like mem_load_uops_retired.dram_hit and mem_bound_stalls.load_dram_hit, and took a ratio of the stalled cycles to retired uops, as an estimation of the average rate of DRAM hits.

The overall traffic at LLC is longest_lat_cache.reference and .miss events, the demand traffic is ocr.demand_data_rd.any_response and ocr.demand_data_rd.l3_miss. Then there are DRAM hit events: number of retired uops that hit DRAM mem_load_uops_retired.dram_hit, and number of cycles stalled by a DRAM hit mem_bound_stalls.load_dram_hit.

prefetch state	LLC reference	LLC miss	demand any_response	demand l3_miss	retired dram_hit	stalls dram_hit	ratio stalls/retires	all cycles
0xd 0b1101 no L2, no L1 IP	548M	532M	510M	506M	504M	14_933M	29.6	23_037M
0x5 0b0101 no L2, L1 IP yes	545M	531M	87M	84M	81M	12_764M	157.6	19_371M
0xc 0b1100 L2 yes, no L1 IP	981M	910M	220M	217M	231M	5_871M	25.4	12_371M
0x4 0b0100 L2 yes, L1 IP yes	974M	919M	265M	261M	32M	6_108M	190.8	11_851M

There are some clear points:

In general, when L2 prefetcher is on, there are less retired dram_hit uops. I.e. the prefetcher does its job: less instructions go to DRAM.
And when L1 IP prefetcher is on, there are even less retired dram_hit uops. For example, L1 IP prefetcher reduces dram_hit retired uops from 231M down to 32M, when L2 is also on.
But when L1 IP prefetcher is on, and many dram accesses are avoided, the rate of the remaining DRAM hits is slower: the stalls mem_bound_stalls.load_dram_hit get longer.
- That's the main reason why the DRAM bus is not saturated: the latency of the remaining demand transactions is not fully absorbed by the queues.
- I think that the conclusion here is that the rate gets slower because the concurrency resources are booked by the prefetcher. But I am not sure how to show it definitively with some PMU metrics. perf lists events like mem_uops_retired.load_latency_gt_64, but I did not figure out how to use it, perf stat shows it "unsupported".
E.g. when L2 prefetcher is ON, the overall speed is about the same, regardless whether L1 IP is on or off. (It is slightly better when the prefetcher is on.) I.e. it looks like the queues do absorb latency when L1 IP prefetcher is off.
But this latency hidding is not sufficient to compensate for when L2 prefetcher is off.
- This bit may be a good clue here. L2 prefetcher performance seems to be equivalent to a large queue. I.e. probably L2 prefetcher is a smaller devices (in silicon) than an equivalent-performance queue. And that is what drives the tradeoff against just having large queues. ("Equivalent-performance queue" means having many more memory transactions in flight, which means many more uops in ROB waiting to be scheduled, etc, more downsides. I keep thinking "why not have larger queues?" just because my impression is that's how GPU saturates the bandwidth. It's typical for GPU to not cache at all and just rely on latency hiding and keep more operations in flight. But, it is unfair to compare 1 core from a cluster to a whole GPU.)
L2 prefetcher adds traffic non-demand at LLC, but reduces the LLC demand traffic by a factor of 2 or more.

In general, why the bandwidth is not saturated: it looks like when prefetchers are on, they reduce the demand traffic to DRAM, but the remaining traffic does not get covered by the concurrency of queues as well. So, it looks like what Peter says: the prefetcher and the demand compete for the same resource of superqueue etc, and that resources becomes a bottleneck. Also, it seems like the prefetcher makes a more efficient use of the resources than demand-in-flight. I.e. when L2 prefetcher is on, it improves the bandwidth utilization much more than what the available queues could provide with latency-hiding of transactions in flight.

But also strange ones:

Why demand LLC references increase when both prefetchers are ON?
When both prefetchers are ON, there is a large difference between retired dram-hit uops (mem_load_uops_retired.dram_hit) and the demand LLC misses (ocr.demand_data_rd.l3_miss) - why so?

Table of uops hits:

prefetch state	mem_load_uops_retired.l2_hit	.l3_hit	.dram_hit
0xd 0b1101 no L2, no L1 IP	2.81M	2.86M	504.13M
0x5 0b0101 no L2, L1 IP yes	2,48M	2,43M	81.48M
0xc 0b1100 L2 yes, no L1 IP	218.04M	3.88M	231.1M
0x4 0b0100 L2 yes, L1 IP yes	25.86M	0.796M	32.02M

And a table of ratios:

mem_bound_stalls.load_l2_hit / mem_load_uops_retired.l2_hit
mem_bound_stalls.load_llc_hit / mem_load_uops_retired.l3_hit
mem_bound_stalls.load_dram_hit / mem_load_uops_retired.dram_hit

prefetch state	L2 ratio	LLC ratio	DRAM ratio
0xd 0b1101 no L2, no L1 IP	11.5	32.76	29.6
0x5 0b0101 no L2, L1 IP yes	17.38	33.97	156.7
0xc 0b1100 L2 yes, no L1 IP	4.76	4.85	25.4
0x4 0b0100 L2 yes, L1 IP yes	10,17	24.23	190.7

Solution

The HW prefetchers in L2 are very important for this; you're overlooking their role in memory-level parallelism for sequential loads/stores. Otherwise you would indeed have a worse bandwidth bottleneck if just limited by the core proper's ability to track in-flight load uops and cache lines.

Each load uop has one load buffer entry allocated for it (when it issues (Intel terminology) into the back-end.) And yes, on Silvermont-family including Gracemont, the max width of a single load uop is 128 bits. I'm not sure if load buffer entries are just for tracking completion and ordering, or if they have space to actually hold the data.

In Intel P-cores at least, cache-miss loads don't need to re-run (get replayed) on an execution port to fetch the data after it arrives; dependent uops just wake up and have the data forwarded to them, even if their execution port is busy when the data arrives so they have to wake up a few cycles later (I haven't tested that). Cache-line-split loads don't need to get replayed (only 1 total count of uops_dispatched), although they might take a second cycle in the execution unit later, like after the first try reaches the end of the pipelined load unit. (Dependent uops that use the load result actually do get replayed, though, as they're eagerly sent to execution units in anticipation of data arriving from an L2 cache hit this cycle, then again when an L3 cache hit could be anticipated, then retrying until success. I wouldn't be surprised if Gracemont is less aggressive about that, though. Its low-power Silvermont ancestry makes me guess a more conservative strategy is likely, like dispatching uops to execution units the cycle after the inputs are known to be ready, as indicated by load buffers.)

So the data has to go somewhere when it arrives besides into L1d cache. A PRF (physical register file) entry is a possibility. Something would have to write it there, but it does need to somehow get into a PRF entry for loads like movdqu xmm0, [rsi]; long after that retires and frees any resources (like a load buffer) that had been allocated for it, reading XMM0 still needs to get the load result from the PRF. And we know that load uops are deallocated from the scheduler very early, so it's not a load uop running on a load execution unit that does the PRF write, at least not for cache-miss loads.

Loads like addps xmm1, [rsi] only have the load read by the ALU uop part of the same instruction, probably by using a microcode-use-only register. (We know those exist and are renamed by the RAT, and take up space in the PRFs, accounting for a discrepancy in measured capacity vs. vendor published capacity; that specific article mentions some discrepancies but doesn't make any assumptions about them.)

My best guess at a design that seems plausible (or a mental model that's hopefully consistent with any testable predictions it makes, although I haven't tried to confirm this on Skylake, let alone Gracemont): When a cache line of data arrives, any load-buffer entries that match it grab the chunk of it they want and write it to a PRF, and make it available on the forwarding network. For L1d hits, this probably happens from the load execution unit itself.

I'm now having doubts about how data arriving after a cache miss is handled; it might need a cycle on a load port to do some of the work, since we could potentially have had all 80 loads on the same cache line, and 80 PRF writes in the same cycle when that cache line arrives is obviously impossible. The PRFs have a lot of write ports, but nowhere near 80. And it's implausible that every load-buffer entry could have its own connections to the forwarding networks. So maybe load-buffer entries store the address/width info needed for a load execution unit to come back to this load if it didn't hit in L1d. It doesn't need a uop to be sent to it from the scheduler (RS) for that, but I wouldn't be surprised if it did need to steal a cycle on a load port. This could be tested with a microbenchmark that sometimes misses to L2 but mostly sustains 2/clock L1d hits, and see how much throughput an L1d miss costs even with an L2 hit.

And which resource is typically the most precious, i.e. what can typically become a bottleneck in keeping the transactions in flight concurrently, reduce the latency hiding effect, and limit the utilization of the memory bandwidth by a single CPU core?

L2 superqueue entries (1 per off-core cache line request, or in the case of an E-core cluster, 1 per off-cluster request). And/or maybe Line Fill Buffers (1/cache line that L1d is waiting to receive or send. In the send case, that it's waiting for an acknowledgement that something else has received it enough that we can stop tracking it).

With Gracemont, for sequential reads you need 4 load buffer entries per cache line since the hardware is only 128-bit wide, splitting even an AVX load like vmovdqa ymm, [rdi] into 2 load uops. Unlike with Haswell and later where max bandwidth with AVX involves just 2 loads and/or 2 stores per cache line (LFB / superqueue entry). But even so, Gracemont's 80 load buffer entries are enough for a single core to keep 20 LFBs busy, and I don't think even contemporary P cores have that many. (Skylake has 12 LFBs, up from 10 in Haswell, for example. IIRC, Skylake's L2 superqueue is something like 16 entries.)

The superqueue has more entries than there are LFBs because the L2 is where the most important HW prefetcher lives. The streamer detects sequential and strided access patterns (within the same 4K page) in requests made to L2, and (if there are spare superqueue entries) sends out requests for later cache lines.

This means the LFBs don't have to hide all the latency of getting a cache line all the way from DRAM to the core proper.

So, do they mean that when max reordering capacity is needed (full ROB of micro-ops with memory accesses), [...]
But typically ROB is just bigger than the load/store queues, so that it is never a bottleneck for the bandwidth?

Correct, we can't get into that state; Gracemont's 256-entry ROB is much larger than the sum of load + store buffer sizes (80 + 50 = 130). A load or store buffer entry has to be allocated for a uop when it issues from the front-end to the back-end. The alloc/rename stage (part of issue) allocates any/all back-end resources needed for a uop of that type. (Except for things like split-load buffers which are only allocated inside a load execution unit when a cache-line split is discovered once the address is known.)

The ROB slot is not freed immediately, because ROB has to wait for the retirement/completion of the memory access micro-op, in case the memory access raises errors

Actually because of x86's memory model being strongly-ordered, and the possibility of needing to roll back due to memory-order mis-speculation between any two loads.

On ARM for example, LoadStore reordering is possible even on out-of-order exec CPUs (How is load->store reordering possible with in-order commit?), not just in-order exec with scoreboarding of loads. Stores can't commit from the store buffer to L1d until after retirement, so that means loads must also be taking their data from cache after they retire from the ROB. (That would mean load buffer entries can't be reclaimed at retirement, complicating allocation if it's not just a circular buffer, unless they do only free them in order.) "Executing" a load means doing the address math and verifying that it's non-faulting, but weakly-ordered ISAs can let it retire from the ROB after that, with just a load buffer entry waiting for the data and signalling completion to anything waiting to use the load result. (For years I'd assumed loads would need to fully complete before they could retire from the ROB, but they don't in general. Only on x86.)

I wonder whether it is different for the prefetch instruction?

Yes, even on x86, SW prefetch instructions are fire and forget. They can fully retire from the ROB without waiting for the cache line to arrive, so they don't stall execution.

It could also trigger more memory traffic (more cache lines transferred) than a usual micro-op of a regular width

No, SW prefetch is like a 1-byte load. Note how Intel's manual documents PREFETCHT1 m8, where m8 is an 8-bit memory operand. This means cache-line and page splits are impossible for a single prefetch instruction, simplifying the hardware. And you can use SW prefetch on any byte of an object to pull in one of the cache lines it's part of, the one starting at addr & -64 (assuming a line size of 64, of course.)

You can touch 1 cache line per uop with movzx eax, byte [rsi] / movzx eax, byte [rsi+64], etc. Bandwidth will be limited purely by LFBs and/or the superqueue, not clogging up the core tracking multiple loads per cache line.

Also related:

Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?
Enhanced REP MOVSB for memcpy (also goes into NT store bandwidth issues, including that NT stores seem to hand off later (with higher latency), so single-threaded NT-store bandwidth is worse on Xeons than you'd otherwise expect. Single-threaded bandwidth is already bad on many-core Xeons due to the higher latencies; it takes most of the cores to saturate the DRAM controllers and make aggregate bandwidth plateau.