Does the hardware prefetcher operate on contiguous virtual addresses, or is it operating on contiguous hardware addresses? Imagine the case where you have a large array of bytes which span multiple pages. In the virtual address space the bytes are contiguous, but in fact the pages could be allocated in disjoint pages in hardware. I would hope that the prefetcher is able to do the appropriate conversion using the TLB before it starts to bring in cache lines that belong to the next page.
Is this so? I couldn't find information that confirmed this and was hoping someone could give more insight.
I'm asking for x86 mainly, but any insight would be appreciated
I can't answer this for AMD processors, but I can answer it for Intel ones.
As far as I know, the hardware prefetcher(s) should not prefetch cache lines across page boundaries on current Intel processors.
From Intel's Intel® 64 and IA-32 Architectures Optimization Reference Manual, section 7.5.2, Hardware Prefetch:
Automatic hardware prefetch can bring cache lines into the unified last-level cache based on prior data misses. It will attempt to prefetch two cache lines ahead of the prefetch stream. Characteristics of the hardware prefetcher are:
- [...]
- It will not prefetch across a 4-KByte page boundary. A program has to initiate demand loads for the new page before the hardware prefetcher starts prefetching from the new page.
Above paragraph is talking about "unified last-level cache", but things aren't better in L1d land:
2.3.5.4, Data Prefetching
Data Prefetch to L1 Data Cache
Data prefetching is triggered by load operations when the following conditions are met:
[...]
The prefetched data is within the same 4K byte page as the load instruction that triggered it.
Or in L2:
The following two hardware prefetchers fetched data from memory to the L2 cache and last level cache:
Spatial Prefetcher: [...]
Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page.
However, the processor might prefetch paging data. From Intel's Intel® 64 and IA-32 Architectures Software Developer Manuals, Volume 3A, 4.10.2.3, Details of TLB Use:
The processor may cache translations required for prefetches and for accesses that are a result of speculative execution that would never actually occur in the executed code path.
Volume 3A, 4.10.3.1, Caches for Paging Structures:
The processor may create entries in paging-structure caches for translations required for prefetches and for accesses that are a result of speculative execution that would never actually occur in the executed code path.
I know you asked about hardware prefetching, but you should be able to use software prefetching for data (not instructions):
In older microarchitectures, PREFETCH causing a Data Translation Lookaside Buffer (DTLB) miss would be dropped. In processors based on Nehalem, Westmere, Sandy Bridge, and newer microarchitectures, Intel Core 2 processors, and Intel Atom processors, PREFETCH causing a DTLB miss can be fetched across a page boundary.