cpucpu-architectureintelcpu-cacheprefetch

Does the last level cache see the PC?


I was recently reading some papers on caching data prefetching. I found that the cache prefetch technology implemented in LLC needs to record the PC of the memory access instruction. But does the LLC level see the PC? The LSU sends the physical address to the L1 cache. If the L1 cache is invalidated, the address is sent to a lower cache level.

If an L1 cache access missed, the address is sent to a lower cache level. Why is the PC passed down?

Thanks!!!


Solution

  • No, in most cases it would be extremely inefficient to waste area and power on additional buses to send the PC address or to waste bits in the cache to store it, all for a relatively small IPC gain that it could provide. It might make sense in academic work that is usually not restricted by super-realistic performance models and design constraints, but the industry has these restrictions and i'm not familiar with any CPU that does that.

    Note that you're confusing two different things - a cache line usually has very little attributes stored with it since the cache capacity is critical and any bit would be multiplied by thousands of lines (especially in the larger caches). You need the data, tag, ECC (error-correction bits), MESI(F/O/..) state and perhaps a few bits for performance-related attributes such as dead-block prediction proposed here. The eviction can maintain or drop some of this information when moving the line to the next level cache but this information is not needed for prefetching since prefetching is usually not effected by eviction (although I admit an eviction-triggered prefetcher would be an interesting idea to think about for some scenarios). Either way you don't want to record full PCs (or even shorter versions), the area impact wouldn't make sense.

    On the other hand, a memory fetch request is what triggers prefetching as it travels to the first cache level where it can find the data. This also meant that is cannot trigger prefetches in levels it doesn't reach (a cache hit "filters" the request stream), this is usually a desired effect since it reduces the training stress and cleans up the history sequence for prefetches but sometimes it could be detrimental. Since these fetch request are only alive during the fetch and stored only in temporary buffers they do have room for additional attributes to help you manage them. Some of this information can help you make better prefetching decisions so most attributes are are indeed sent to the hardware prefetcher including the PC (some of the prefetchers such as Intel's IPP are PC based, but that's also an L1 prefetcher). However, not all the data has to go all the way to the lower-level caches, you can pass only what you need and you usually want to pass as little as possible since you pay for each wire. Crossing the core/SoC boundary this is even more painful since you have to fit into limited standard protocol packets as well.

    Some academic work about PC-localized cache replacement policies like SHIP++, Hawkeye, Mockingjay and other ML schemes propose adding PC info on top of these requests so it gets to the L2/L3. We're talking about request attributes - it can either be used to immediately to affect lower level prefetchers or (in the case of the replacement policies above) - reside in the temporary request buffers until the time the request receives the data and writes into the cache. Some of the scheme do require storing some attributes into the cache but not the full blown PC. Others use a dedicated array that is smaller and can serve as a lookup table using the PC of the request. As a rule of thumb - the further you send something and the more you store it - the more you have to justify it.