I was reading about the pipeline optimizations recently. I wanted to ask if I understand correctly how a processor handles pipelining.
Here is C++ code for simple test program:
#include <vector>
int main()
{
std::vector<int> vec(10000u);
std::fill(vec.begin(), vec.end(), 0);
for (unsigned i = 0u; i < vec.size(); ++i)
{
vec[i] = 5;
}
return 0;
}
And part of assembler code produced by for loop:
...
00007FF6A4521080 inc edx
{
vec[i] = 5;
00007FF6A4521082 mov dword ptr [rcx+rax*4],5
00007FF6A4521089 mov eax,edx
00007FF6A452108B cmp rax,r9
00007FF6A452108E jb main+80h (07FF6A4521080h)
}
...
In program, vector "vec" is allocated with constant size and filled with zeros. Important "work" happens in for loop, where all vector variables are assigned to 5 ( just a random value).
I want to ask if this assembler code makes some stall in the pipeline? The reason would be that all instructions are somehow correlated and work on the same registers. For example, the pipeline would need to wait with instruction cmp rax r9
, before mov eax, edx
actually assign value to eax/rax?
Looping 10000 times is where branch prediction should go into work. jb instruction jumps 10000 times and only at the end it will pass through. It means that branch predictor should predict very easily that jump will happen most of the time. However, this optimization would be pointless from my perspective, if the code itself makes stall inside the loop.
My target architecture is Skylake i5-6400
TL;DR:
Case 1: A buffer that fits in the L1D. The vector constructor or the call to std::fill
will place the buffer fully in the L1D. In this case, the 1 store per cycle throughput of the the pipeline and the L1D cache are the bottleneck.
Case 2: A buffer that fits in the L2. The vector constructor or the call to std::fill
will place the buffer fully in the L2. However, the L1 has to write dirty lines back to the L2 and there is only one port between the L1D and the L2. In addition, the lines have to be fetched from the L2 to the L1D. The 64B/cycle bandwidth between the L1D and L2 should be able to easily handle that perhaps with occasional contention (see below for more details). So overall the bottleneck is the same as in Case 1. The particular buffer size you used, about 40KB, doesn't fit in the L1D of Intel and recent AMD processors, but fits in the L2. Although in case of Simultaneous Multi-Threading (SMT), there might be some additional contention from the other logical core.
Case 3: A buffer that doesn't fit in the L2. The lines need to be fetched from the L3 or memory. The L2 DPL prefetcher can track the stores and prefetch the buffer into the L2, thereby alleviating the long latency. The single L2 port is the bottleneck together with the L1 writeback and fill buffers. It's severe, especially when the buffer doesn't fit in the L3 where the interconnect can also be on the critical path. The 1 store throughput is too much for the cache subsystem to handle. The two most relevant performance counters are L1D_PEND_MISS.REQUEST_FB_FULL
and and RESOURCE_STALLS.SB
.
First, note that the constructor (which will probably get inlined) of vector
itself initializes the elements to zero by calling memset
internally. memset
basically does the same thing as your loop, but it is highly optimized. In other words, in terms of big-O notation, both are linear in the number of elements, but memset
has a smaller constant factor. In addition,std::fill
also internally calls memset
to set all elements to zero (again). std::fill
will also probably get inlined (with proper optimizations enabled). Therefore, you really have three loops in that piece of code. It would be more efficient to initialize your vector using std::vector<int> vec(10000u, 5)
. Now let's to get to the microarchitectural analysis of the loop. I'll only discuss what I expect to happen on modern Intel processors, specifically Haswell and Skylake1.
Let's examine the code carefully:
00007FF6A4521080 inc edx
00007FF6A4521082 mov dword ptr [rcx+rax*4],5
00007FF6A4521089 mov eax,edx
00007FF6A452108B cmp rax,r9
00007FF6A452108E jb main+80h (07FF6A4521080h)
The first instruction will be decoded into a single uop. The second instruction will be decoded into two uops that are fused in the frontend. The third instruction is a register-to-register move and is a candidate for move elimination at the register renaming stage. It's hard to know for sure whether the move will get eliminated without running the code3. But even if it did not get eliminated, the instructions will be dispatched as follows2:
dispatch cycle | allocate cycle
cmp rax,r9 macro-fused | inc edx (iteration J+3)
jb main+80h (07FF6A4521080h) (iteration J) | mov dword ptr [rcx+rax*4],5 (iteration J+3)
mov dword ptr [rcx+rax*4],5 (iteration J+1)| mov eax,edx (iteration J+3)
mov eax,edx (iteration J+1)| cmp rax,r9 macro-fused
inc edx (iteration J+2)| jb main+80h (07FF6A4521080h) (iteration J+3)
---------------------------------------------------------|---------------------------------------------------------
cmp rax,r9 macro-fused | inc edx (iteration J+4)
jb main+80h (07FF6A4521080h) (iteration J+1)| mov dword ptr [rcx+rax*4],5 (iteration J+4)
mov dword ptr [rcx+rax*4],5 (iteration J+2)| mov eax,edx (iteration J+4)
mov eax,edx (iteration J+2)| cmp rax,r9 macro-fused
inc edx (iteration J+3)| jb main+80h (07FF6A4521080h) (iteration J+4)
The cmp
and jb
instructions will get macrofused into a single uop. So the total number of uops is 4 in both the fused domain and 5 in the unfused domain. There is exactly one jump among them. Therefore, a single loop iteration can be issued per cycle.
Because of the dependency between inc
and mov
-store, these two instructions cannot be dispatched in the same cycle. Nonetheless, inc
from a previous iteration can be dispatched with uops from the previous iteration.
There are four ports (p0, p1, p5, p6) to which inc
and mov
can be dispatched. There is exactly one port, p6, for the predicted-taken cmp/jb
. There are three ports (p2, p3, p7) for the STA uop of mov dword ptr [rcx+rax*4],5
and one port, p4, for the STD uop. (Although p7 cannot handle the specified addressing mode.) Because there is only one port for each, the maximum execution throughput that can be achieved is 1 iteration per cycle.
Unfortunately, the throughput will be worse; many stores will miss in the L1D. The L1D prefetchers are not capable of prefetching lines in the Exclusive coherence state and don't track store requests. But fortunately, many stores will be combined. Successive stores in the loop target sequential locations in the virtual address space. Since a line is 64 bytes in size and each store is 4 bytes in size, every 16 consecutive stores are to the same cache line. These stores can be combined in the store buffer, but they won't because the stores will retire as early as possible once they become at the top of the ROB. The loop body is pretty small so it's very unlikely that more than few of the 16 stores will get combined in the store buffer. However, when the combined store request gets issued to the L1D, it will miss and an LFB will be allocated, which also supports combining stores. The L2 cache DPL prefetcher is capable of tracking RFO requests, so hopefully we will almost always hit in the L2. But it will take at least 10-15 cycles to get the line from the L2 to the L1 at. The RFO might be sent early, though, before the store actually gets committed. At the same time, most likely a dirty line needs to be evicted from the L1 to make space from the incoming line to be written to. The evicted line will be written in a writeback buffer.
It's hard to predict what the overall effect will be without running the code. The two most relevant performance counters are L1D_PEND_MISS.REQUEST_FB_FULL
and and RESOURCE_STALLS.SB
.
The L1D has only one store port that 16-byte, 32-byte, 64-byte wide on Ivy Bridge, Haswell, and Skylake, respectively. So the stores will be committed at these granularities. But a single LFB can always hold a full 64-byte cache line.
The total number of store fused uops is equal to the number of elements (1 million in this case). To get the number of LFBs required, divide by 16 to get 62500 LFBs, which is the same as the number of RFOs to the L2. It will take 16 cycles before requiring another LFB because only one store can be dispatched per cycle. As long as the L2 can deliver the target line within 16 cycles, we will never block on the LFBs and the achieved throughput will be close to 1 iteration per cycle, or in terms of IPC, 5 instructions per cycle. This is only possible if we almost always hit int the L2 in a timely manner. Any consistent delay in cache or memory will significantly reduce the throughput below that. It might go something like this: a burst of 16 iterations will execute quickly, then the pipe stalls on the LFBs for some number of cycles. If this number is equal to the L3 latency (about 48 cycles), then the throughput would be around 1 iteration per 3 cycles (= 16/48).
The L1D has a limited number (6?) of writeback buffers to hold evicted lines. In addition, the L2 has only a single 64-byte port that is used for all communication between the L1D and L2 including writebacks and RFOs.The availability of writeback buffers may be on the critical path as well. In that case, the number of LFBs also be a bottleneck because an LFB will not be written into the cache until a writeback buffer is available. If not, the LFBs will fill up quickly, especially if the L2 DPL prefetcher was able to deliver the lines in a timely manner. Obviously streaming cacheable WB stores into the L1D is very inefficient.
If you do run the code, you need to also account for the two calls to memset
.
(1) On Sandy Bridge and Ivy Bridge, the instruction mov dword ptr [rcx+rax*4],5
will get unlaminiated, resulting in 5 uops per iteration in the fused domain. So the frontend may be on the critical path.
(2) Or something like that, depending on whether the first instruction of the first iteration of the loop gets the first slot of the allocator. If not, the iteration numbers shown need to be shifted accordingly.
(3) @PeterCordes found that move elimination does occur most of the time on Skylake. I can also confirm this on Haswell.