I want to prefetch data into the L1 cache and perform other work while waiting for the data to arrive, to avoid stalling the loop. Is there a way to determine if the prefetched data has arrived in the L1 cache? This would allow me to continue with the main processing until the data is available.
I believe such a mechanism must exist because modern CPUs need to efficiently switch threads when the pipeline stalls, and knowing the cache status would be crucial for such operations.
Here is a pseudocode of what I am trying to achieve:
#include <immintrin.h>
void processData(int* data, size_t size) {
// Prefetch the next data chunk
_mm_prefetch(someData);
for (size_t i = 0; i < size; ++i) {
// Check if the prefetched data has arrived in the L1 cache
// if (data_arrived_in_L1_cache()) {
// // Process the data
// process(data[i]);
// } else {
// // Do other work
// doOtherWork();
// }
}
}
Is there a specific flag or instruction that can be used to check if the prefetched data has arrived in the L1 cache?
Alternatively, is there any way to detect if the pipeline has stalled without stopping execution? I assume debuggers and profiling tools use such mechanisms.
Maybe there are some C++ intrinsics or other low-level techniques that can help achieve this?
Profiling software typically has ways to detect cache misses and hits. Is there a way to repurpose these mechanisms for my use case?
No, there isn't an instruction you can run to query the cache status of an address.
Tuning prefetch-distance (prefetch data[i+distance]
while processing data[i]
) is unfortunately a matter of putting a fixed amount of doOtherWork
between the load and the use of the load.
Keep in mind that memory-level parallelism is a thing, (e.g. Skylake can have up to 12 cache lines in flight to/from L1d cache; out-of-order exec can get later loads started while older loads are still waiting). Also, HW prefetch into L2 and L1d cache will already be close to optimal for simple cases like this (sequential access); they'll have data for the later cache lines flowing into caches while process(data[i])
is happening. See How much of ‘What Every Programmer Should Know About Memory’ is still valid? - HW prefetch has come a long way since the original article was written. (But you should still read the original if you haven't.)
If your computational intensity is high enough (amount of work done per load bandwidth into registers or into L1d cache), HW prefetch will just keep up with what you're doing. e.g. by doing more work on each pass over your data so its still in registers, or cache-blocking so you re-read some data you used recently so you get L1d hits.
Tuning software prefetch is unfortunately difficult and system-specific, and can even depend on competition from other cores for memory bandwidth. It would be nice if there was a low-overhead way to make prefetch distance dynamic like this, but I don't think there is. Especially NT prefetch, which doesn't "pollute" L2 cache by leaving a copy there. (Or L3 cache on systems where it's not inclusive). So if you prefetch too far ahead and the data is evicted before your code uses it, you get another cache miss all the way to L3 or DRAM.
There are PMU event counters you can use on the whole loop (most easily by isolating the loop into a whole program that is just a microbenchmark), like mem_load_retired.l1_miss
(which exists on my Intel Skylake using Linux perf
; available events and their names differ by microarchitecture.) If you're trying to prefetch early enough for data to be in L1d, you want to keep increasing prefetch distance until l1_miss counts drop.
Another relevant event is cycle_activity.stalls_l3_miss
which counts cycles (or just starts of stalls, not every cycle of each stall?) that happen while the core is waiting for a load result that missed in L3. (A "demand" miss is one that isn't from the prefetchers.)
A stall means the ReOrder Buffer (ROB) filled, or some other back-end resource like load-buffer entries, so the front-end couldn't "issue" (non-Intel terminology: "dispatch") any more instructions / uops into the out-of-order back-end. The ROB is a circular buffer that instructions issue into and retire from in program order. (The scheduler or schedulers are separate, and only track instructions that haven't yet executed; their entries can be freed out-of-order.) Other independent work can still have made progress during the stall (being executed and ready to retire), but one instruction which can't retire (e.g. a cache-miss load) blocks retirement since that has to happen in program order for precise exceptions (rolling back to a consistent state at the faulting instruction). When instructions aren't leaving the ROB, it will eventually stall when there's no room for new instructions to enter.
But one cache miss doesn't stall the whole core; one of the major reasons for doing out-of-order exec is to hide cache-miss latency.
because modern CPUs need to efficiently switch threads when the pipeline stalls,
Modern CPUs with multiple logical cores per physical core use fine-grained SMT, not switch-on-stall. See wikipedia and https://www.realworldtech.com/alpha-ev8-smt/ (a good article about the first implementation of SMT.) Also Modern Microprocessors A 90-Minute Guide! has a section about SMT.
The front-end alternates cycles between logical cores unless one is stalled (either from its share of the back-end being full, or it being stalled in the front-end, e.g. on an I-cache miss, or stalled in the middle of branch mispredict recovery). Anyway, stall doesn't happen until the ROB (reorder buffer) is full, so individual cache misses aren't something the front-end would care about. –