cpu hardware prefetch micro-architecture

How instructions are fetched into modern CPUs(2023)?

I am learning rocketchip these days, and I have noticed the IFU(Instruction Fetch Unit) fetches instructions from ibuf instead of main memory. But I have not seen any codes about how instructions are fetched from main memory to ibuf. I consulted with some experts and got words like icache, dcache and prefetch. I want to dig into the process.

Can anyone explain the instruction fetch process in mordern CPUs? Or which books can help me understand this process? In other words, are there any books that provide a detailed explanation of the process of instruction fetching in modern processors?

Thank you so much for your assistance!

I have found some information online, but I suspect that what I obtained may not be systematic. Therefore, I would like to learn the entire process systematically.

Solution

The exact details of how a particular CPU fetches its instructions would probably be behind a NDA as each processor manufacturer would have its own circuit for the fetch unit. So it's not possible for me to comment on a particular CPU. However, at a very high level, the front-end (the stages which are responsible for instruction fetch and decode) of modern processors consists of pre-fetchers, instruction caches (I-Cache) and branch predictors.

Various CPUs may or may not have these three components depending on the type of applications they are designed for. For example, a simple processor for a toy may not need these structures and it may directly access the memory to fetch the instructions. On the other hand, a processor made for high performance computing tasks may have multiple pre-fetchers and branch predictors along with a potentially multi level I-cache. So the exact architecture of the front-end depends on what the processor is designed for. For the rest of this answer, I'm assuming that you are talking about a processor which is designed for high performance or desktop computing. Moreover, please keep in mind that the following explanation may not hold for every processor and that it is just a high level view of things.

Modern processors, on the outside, follow Von Neumann architecture which means that they expect the data for a program and its instructions to be stored in a single memory. The RAM in you computer acts as this memory. The CPU asks the RAM for instructions/data by providing an address, and the RAM returns the binary values stored at the specified address. Note that the RAM does not distinguish between instructions and data. To the RAM, everything is just a bunch of binary values. Once these instructions/data reach the CPU, they end up in the last level cache (LLC). The LLC serves as a small but fast storage for the CPU. Next, the instructions/data are forwarded to the next level of the cache hierarchy which is typically the level 2 (L2) cache. Up till the L2 cache, there is no distinction between data and instructions. Now the L2 cache forwards the data to the level 1 (L1) cache. The L1 cache, on the other hand, is divided into two sub-parts which are called the data cache (D-Cache) and the instruction cache (I-cache). From the L1-cache onwards, the processor follows the Harvard architecture. Once the data reaches the D-Cache and the instructions reach the I-cache, the execution unit of the CPU can start accessing the instructions and the data.

The instructions are accessed by querying the I-cache. The I-cache takes the address of the instruction as the input and returns the instruction which is supposed to be present at the specified address. However, even though the I-cache is pretty fast (relative to other kinds of memory in a system), it may still take 10s of cycles to respond to the execution unit (due to something called cache misses, but that is beyond the scope of this explanation). This means that the CPU will only be able to execute instruction every 10s of cycles.

Thus, to mitigate this issue, computer architects devised pre-fetchers. As the name suggests, a pre-fetcher will fetch the instruction and store it into the I-cache before it is even required. This means that even though the execution unit has not accessed a particular address, the pre-fetcher will still make a request for that address to the I-cache. To put it simply, the pre-fetcher tries to predict what instruction would be executed next and tries to get it into the I-cache. However, due to the limitations of pre-fetchers, they are often very bad at predicting certain kind of instructions.

One example of such instructions are instructions which follow a branch instruction. When the execution unit encounters a branch instruction, it must first resolve the branch, i.e. execute the branch code, to figure out which direction the program flow will go before it can figure out the address of the next instruction. For example, if you have an if condition in your code, until you can compute whether the condition would be taken or not, you wouldn't know which instruction will be executed next. However, due to the deeply pipelined nature of modern processors, the processor may take 100s of cycles to resolve the branch. This is called the branch penalty. During these cycles, the front-end of the processor will be stalled, i.e. it would not be able to fetch any instruction, as it would not know from where it has to fetch the next instruction. This will make the performance of the processor much worse for programs with lots of branches. As it turns out, 5-10% of instructions of most programs are branch instructions. Therefore, to handle this issue, the computer architects designed branch predictors. As the name suggests, these structures try to predict the outcome and the direction of branches before they are resolved. Modern branch predictors are more than 99% accurate for many applications. Thus modern processors only have to pay the huge branch penalty for around 1% of all the branch instructions for most programs.

Thus, with the help of branch predictors and pre-fetchers, modern processors are able to ensure that for most of the execution flow, the instructions will be in the I-cache. This, in turn, speeds up the instruction fetch stage improving the overall performance of the processor.

Note that I've skipped over a lot of very fascinating details in this explanation to keep it short. If you are interested in this sort of stuff, you may want to look at courses which teach computer architecture. A good book for this subject is Computer Architecture: A Quantitative Approach by David A Patterson and John L. Hennessy.