How jumps and call
-ret
pairs affect the CPU front-end decoder in the best case scenario when there are few instructions, they are well cached, and branches are well predicted?
For example, I run a simple test program on Intel N150 CPU:
// test.cpp
constexpr int N = 256'000'000;
alignas(64) unsigned int AData[N];
constexpr unsigned mask = 0b11011101011;
template <typename T>
T procItem(T item) {
return item & mask;
}
int main() {
for (unsigned long long i=0; i<N; i++) {
AData[i] = i;
}
constexpr int n_proc = 32; // or 64
for (int i=0; i<n_proc; i++) {
asm("" ::: "memory");
// prevent GCC -O3 from doing 32 and+add iterations
// to the same element in a register (loop interchange),
// which bizzarely also stops it from SIMD vectorizing.
for (auto& item : AData) {
item += procItem(item);
//item += item & mask;
}
}
}
If I compile it with GCC and -Og
optimization, procItem
is not inlined in the assembly:
.L7:
movl (%rbx), %ebp
movl %ebp, %edi
call _Z8procItemIjET_S0_
addl %eax, %ebp
movl %ebp, (%rbx)
addq $4, %rbx
.L6:
leaq AData(%rip), %rax
leaq 1024000000(%rax), %rax
cmpq %rax, %rbx
jne .L7
...
_Z8procItemIjET_S0_:
endbr64
movl %edi, %eax
andl $1771, %eax
ret
It runs about 14-17 seconds, pushing something like 1GB/s on the system memory bus. VTune microarchitecture analysis reports that it is mostly front-end bound (43% of the pipeline slots). And most of that is due to stalls in Decode.
A compilation with -O3 -march=gracemont
produces inlined SIMD code that runs 3.8s at about 20GB/s of memory bus bandwidth. (The bus maximum is 32GB/s.)
Then, if I just comment out procItem
and uncomment the line to process the array items, i.e. there is no call
in the assembly, -Og
compilation runs at 6.3s and 10GB/s. This time VTune reports that it is Backend bound (27% of pipeline slots), mainly due to Full Reorder Buffer (17.4%) and non-memory scheduler (7.3%).
.L6:
movl (%rax), %edx
movl %edx, %ecx
andl $1771, %ecx
addl %ecx, %edx
movl %edx, (%rax)
addq $4, %rax
.L5:
leaq AData(%rip), %rdx
leaq 1024000000(%rdx), %rdx
cmpq %rdx, %rax
jne .L6
...
I guess, call
creates a problem for the front-end, because ret
performs a jump to the contents of a register, the stack pointer register. Is that correct? Also, call
and ret
perform a memory access to write and read back the stack pointer. But, I guess, the front-end Decode stalls of about 40% of pipeline slots is due to the jump via a register, not the memory access?
To test this assumption, I tried to add a random jmp label
loop in place of the call
like this:
# test_inlined_jmp4.s
.L6:
movl (%rax), %edx
mov $0, %r11
.LMYLABLE:
movl %edx, %ecx
andl $1771, %ecx
addl %ecx, %edx
add $1, %r11
cmp $4, %r11
jl .LMYLABLE
movl %edx, (%rax)
addq $4, %rax
.L5:
...
It indeed runs much better than call
-ret
, with much less front-end stalls:
perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode -- ./test_inlined_jmp4
Performance counter stats for './test_inlined_jmp4':
265 093 691 896 instructions # 4,68 insn per cycle (71,43%)
56 668 199 782 cycles (71,43%)
273 745 456 topdown-bad-spec (71,43%)
3 963 180 542 topdown-be-bound (71,43%)
13 350 579 533 topdown-fe-bound (71,43%)
265 615 457 686 topdown-retiring (71,43%)
11 689 808 946 topdown_fe_bound.decode (71,43%)
16,592255607 seconds time elapsed
It is slower simply because it executes more code. But IPC is way better, most of pipeline slots retire instructions. (This is a 5-issue CPU, so IPC close to 5 is pretty good.)
Then, I tried to unroll this loop of 4 jumps, keeping the same number of instructions:
mov $0, %r11
#.LMYLABLE:
# 0
movl %edx, %ecx
andl $1771, %ecx
.loc 1 21 12 view .LVU13
addl %ecx, %edx
add $1, %r11
cmp $4, %r11
nop
# 1
...same code
# 2
...
# 3
...
And it runs somewhat better, with even less front-end stalls:
perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode -- ./test_inlined_unroll
Performance counter stats for './test_inlined_unroll':
265 096 917 837 instructions # 4,87 insn per cycle (71,42%)
54 379 876 310 cycles (71,42%)
280 519 725 topdown-bad-spec (71,43%)
4 725 503 253 topdown-be-bound (71,44%)
1 162 691 302 topdown-fe-bound (71,43%)
265 599 486 048 topdown-retiring (71,43%)
215 563 581 topdown_fe_bound.decode (71,42%)
15,230292928 seconds time elapsed
So, is it correct to conclude the following:
jumps via registers, like in call
-ret
pair, stall the front-end decoder? How does the decoder and the branch predictor work in this case? There is no predictor for the register-based jumps, right? (Jumping to code via a register as a pointer - it seems like an operation with non-trivial consequences.)
I.e. following this point in the answer about small branches:
When the instruction is being decoded to uops and fed into the front-end, register values aren't available; those are only available in the out-of-order execution back-end.
Then a jump via a register (with ret
) must make things much worse for the front-end. Does it stall only the decoder or other components of the front-end?
a simple jmp
is still slightly slower than unrolling, it has some small overhead in the decoder? Is it mainly because the branch predictor has to do some work on every branch, even in the best case scenario? By "best case scenario" I mean when there are few instructions, they are well cached, correctly predicted, and Branch Target Buffer is not saturated like in the slow jmp question?
The second point is similar to the question about small branches. But I would like to further clarify it in this angle: there is no special elision of the small-distance branches, short branches have to be speculated like any other branch, but let's consider the best case scenario when branches are speculated well, and the code is small and is supposed to fit into L1i cache (and whatever micro-op caches inside the front-end?) - is there still some remaining overhead for the front-end and its decoder in processing branch instructions (cmp
and jmp
)? From the PMU measurements above, it seems like the overhead is really negligible, comparing to unrolling. Can it add up and get worse in some situations? Like when there are many branches and BTB gets saturated (slow jmp) - that's a good example.
It looks like, just like the comments suggest, the branching itself is not the cause of the most of the performance degradation. And there is indeed not much difference between unconditional jmp
and call
-ret
, i.e. they are covered by the branch predictor equally well.
In such a small program, the bottleneck comes from how the fetcher works. In my case:
.p2align 6
in front of the main loop, which sets the first instruction of the loop exactly on a 64-byte cache line boundary. But the not inlined code, unexpectedly, does not have .p2align 6
in front of the main loop. Just adding .p2align 6
in front of the main loop, improves the performance of the not-inlined code from 14s down to 10s. (I think godbolt does not show this? There is no nopl
in front of .L6
here. It probably has a tool to include the padding nop
s there?)jmp
or call
, the fetcher loads the corresponding L1i cache line, even if it is the same line as the currently loaded one.I compile the code to assembly with GCC 14.2:
# Makefile
%.s: %.cpp
g++ -std=c++23 -Og -g $< -S $@
%: %.s
g++ -std=c++23 -Og -g $< -o $@
The non-inlined code does is not aligned to 64-byte boundary:
main:
...
movl $0, %r12d
jmp .L5
<no .p2align!>
# manually added alignment directive:
#.p2align 6
.L7:
movl (%rbx), %ebp
movl %ebp, %edi
call _Z8procItemIjET_S0_
...
The performance is pretty bad. Frontend is stalled a lot, mostly in "decode" stalls. There are 8B iterations (32 repetitions on an array of 256M elements). But close to 40B of L1i cache loads:
perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode,L1-icache-loads,mem_bound_stalls.ifetch -- ./test
Performance counter stats for './test':
117 654 640 228 instructions # 2,26 insn per cycle (55,55%)
52 001 267 137 cycles (55,55%)
274 424 423 topdown-bad-spec (55,55%)
12 346 611 654 topdown-be-bound (55,56%)
129 060 603 366 topdown-fe-bound (55,56%)
118 201 934 446 topdown-retiring (55,56%)
123 176 125 025 topdown_fe_bound.decode (55,56%)
41 531 138 121 L1-icache-loads (55,56%)
26 774 607 mem_bound_stalls.ifetch (55,55%)
14,566056453 seconds time elapsed
14,243623000 seconds user
0,314947000 seconds sys
If add .p2align 6
manually, it gets much better:
perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode,L1-icache-loads,mem_bound_stalls.ifetch -- ./test
Performance counter stats for './test':
117 627 976 899 instructions # 3,26 insn per cycle (55,55%)
36 036 207 473 cycles (55,55%)
242 416 565 topdown-bad-spec (55,55%)
15 235 688 116 topdown-be-bound (55,55%)
46 427 105 158 topdown-fe-bound (55,56%)
118 128 368 138 topdown-retiring (55,56%)
43 348 565 514 topdown_fe_bound.decode (55,56%)
25 130 455 182 L1-icache-loads (55,56%)
22 139 479 mem_bound_stalls.ifetch (55,55%)
10,250459485 seconds time elapsed
9,926537000 seconds user
0,321920000 seconds sys
Then I tried to move the _Z8procItemIjET_S0_
code inside the same cache line where the main loop sits in the test program. There is just enough space to fit it after an unconditional jmp
:
gdb ./test
> disas /r main
...
0x000000000000117e <+62>: 66 90 xchg %ax,%ax. # the p2align nop
...
0x0000000000001180 <+64>: 8b 2b mov (%rbx),%ebp
0x0000000000001182 <+66>: 89 ef mov %ebp,%edi
0x0000000000001184 <+68>: e8 77 00 00 00 call 0x1200 <_Z8procItemIjET_S0_>
0x0000000000001189 <+73>: 01 c5 add %eax,%ebp
0x000000000000118b <+75>: 89 2b mov %ebp,(%rbx)
0x000000000000118d <+77>: 48 83 c3 04 add $0x4,%rbx
0x0000000000001191 <+81>: 48 8d 05 e8 2e 00 00 lea 0x2ee8(%rip),%rax # 0x4080 <AData>
0x0000000000001198 <+88>: 48 8d 80 00 00 09 3d lea 0x3d090000(%rax),%rax
0x000000000000119f <+95>: 48 39 c3 cmp %rax,%rbx
0x00000000000011a2 <+98>: 75 dc jne 0x1180 <main()+64>
0x00000000000011a4 <+100>: 41 83 c4 01 add $0x1,%r12d
0x00000000000011a8 <+104>: 41 83 fc 1f cmp $0x1f,%r12d
0x00000000000011ac <+108>: 7f 09 jg 0x11b7 <main()+119>
0x00000000000011ae <+110>: 48 8d 1d cb 2e 00 00 lea 0x2ecb(%rip),%rbx # 0x4080 <AData>
0x00000000000011b5 <+117>: eb da jmp 0x1191 <main()+81>
...
I reduced the add
and removed endbr64
from the procItem
code, and it did fit:
0x000000000000117e <+62>: 66 90 xchg %ax,%ax
0x0000000000001180 <+64>: 8b 2b mov (%rbx),%ebp
0x0000000000001182 <+66>: 89 ef mov %ebp,%edi
0x0000000000001184 <+68>: eb 2e jmp 0x11b4 <main()+116>
0x0000000000001186 <+70>: 01 c5 add %eax,%ebp
0x0000000000001188 <+72>: 89 2b mov %ebp,(%rbx)
0x000000000000118a <+74>: 48 83 c3 04 add $0x4,%rbx
0x000000000000118e <+78>: 48 8d 05 eb 2e 00 00 lea 0x2eeb(%rip),%rax # 0x4080 <AData>
0x0000000000001195 <+85>: 48 8d 80 00 00 09 3d lea 0x3d090000(%rax),%rax
0x000000000000119c <+92>: 48 39 c3 cmp %rax,%rbx
0x000000000000119f <+95>: 75 df jne 0x1180 <main()+64>
0x00000000000011a1 <+97>: 41 83 c4 01 add $0x1,%r12d
0x00000000000011a5 <+101>: 41 83 fc 1f cmp $0x1f,%r12d
0x00000000000011a9 <+105>: 7f 10 jg 0x11bb <main()+123>
0x00000000000011ab <+107>: 48 8d 1d ce 2e 00 00 lea 0x2ece(%rip),%rbx # 0x4080 <AData>
0x00000000000011b2 <+114>: eb da jmp 0x118e <main()+78>
0x00000000000011b4 <+116>: 89 f8 mov %edi,%eax
0x00000000000011b6 <+118>: 83 e0 01 and $0x1,%eax
0x00000000000011b9 <+121>: eb cb jmp 0x1186 <main()+70>
But the L1i cache loads did not get smaller, and the performance did not improve.
Now, the inlined code does only 8B L1i cache loads, instead of the 24B of the not-inlined but aligned code:
perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode,L1-icache-loads,mem_bound_stalls.ifetch -- ./test_inlined
Performance counter stats for './test_inlined':
84 821 505 867 instructions # 4,00 insn per cycle (55,52%)
21 227 922 348 cycles (55,54%)
220 958 811 topdown-bad-spec (55,56%)
16 025 540 940 topdown-be-bound (55,58%)
4 446 478 843 topdown-fe-bound (55,58%)
85 287 580 988 topdown-retiring (55,58%)
3 011 559 579 topdown_fe_bound.decode (55,56%)
8 721 798 673 L1-icache-loads (55,55%)
41 841 088 mem_bound_stalls.ifetch (55,53%)
6,134054414 seconds time elapsed
5,821809000 seconds user
0,310936000 seconds sys
And the code is aligned:
$ less test_inlined.s
main:
...
movl $0, %esi
jmp .L4
.p2align 6
.L6:
movl (%rax), %edx
movl %edx, %ecx
andl $1771, %ecx
...
And in GDB:
(gdb) disas /r main
...
0x000000000000117d <+61>: 0f 1f 00 nopl (%rax)
0x0000000000001180 <+64>: 8b 10 mov (%rax),%edx
0x0000000000001182 <+66>: 89 d1 mov %edx,%ecx
0x0000000000001184 <+68>: 81 e1 eb 06 00 00 and $0x6eb,%ecx
...
If I comment out the .p2align
from the inlined code (or change it to .p2align 2
which still does not set it to 64-byte boundary), I get twice more L1i cache loads and many decode stalls:
perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode,L1-icache-loads,mem_bound_stalls.ifetch -- ./test_inlined
Performance counter stats for './test_inlined':
84 745 317 310 instructions # 3,01 insn per cycle (55,56%)
28 128 515 432 cycles (55,56%)
223 781 428 topdown-bad-spec (55,56%)
9 861 892 463 topdown-be-bound (55,56%)
45 167 677 793 topdown-fe-bound (55,56%)
85 318 251 057 topdown-retiring (55,55%)
42 963 051 680 topdown_fe_bound.decode (55,55%)
16 906 109 601 L1-icache-loads (55,55%)
23 840 977 mem_bound_stalls.ifetch (55,56%)
7,939475212 seconds time elapsed
7,639294000 seconds user
0,298972000 seconds sys
In this case, the first instruction of the main loop sits at the end of one of L1i cache lines, and the second instruction actually spans across two cache lines:
0x000000000000117d <+61>: 8b 10 mov (%rax),%edx
0x000000000000117f <+63>: 89 d1 mov %edx,%ecx
0x0000000000001181 <+65>: 81 e1 eb 06 00 00 and $0x6eb,%ecx
...
0x00000000000011a0 <+96>: 75 db jne 0x117d <main()+61>
The fetcher loads 1 cache line where only the last couple bytes are needed. The decoder parses them but flushes most of its work. Now I think that it indeed makes sense to report it as "Decode stalls" in the PMU counters and VTune, and not some kind of "Fetch stalls". The fetcher does not stall, it loads the bytes. The decoding turns out in-efficient, because the used instructions are on L1i cache boundaries, so little of the decoder work is actually used.
Then, to figure out why L1i cache loads would not decrease when the procedure code was moved to the same L1i cache line as the main loop, I tried to add a dummy loop inside the inlined and aligned code:
.p2align 6
.L6:
movl (%rax), %edx
movl %edx, %ecx
andl $1771, %ecx
addl %ecx, %edx
jmp .LMYLABLE_dummy
.LMYLABLE_dummy:
movl %edx, (%rax)
addq $4, %rax
...
With the dummy jmp
, the inlined and aligned code also gets +8B of L1i-cache loads:
perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode,L1-icache-loads,mem_bound_stalls.ifetch -- ./test_inlined
Performance counter stats for './test_inlined':
93 021 827 918 instructions # 3,30 insn per cycle (55,55%)
28 154 529 729 cycles (55,56%)
234 639 930 topdown-bad-spec (55,56%)
9 211 576 323 topdown-be-bound (55,55%)
37 637 161 357 topdown-fe-bound (55,56%)
93 487 157 601 topdown-retiring (55,56%)
32 428 181 436 topdown_fe_bound.decode (55,56%)
16 911 692 541 L1-icache-loads (55,56%)
24 980 016 mem_bound_stalls.ifetch (55,56%)
7,939847494 seconds time elapsed
7,626760000 seconds user
0,310990000 seconds sys
So, in the case of the not-inlined but aligned code, one iteration of the main loop creates 3 L1i cache loads: 1) the first load of the iteration code, 2) load on call
, 3) load on ret
. Even if the three loads are done on the same L1i cache line, it slows down the frontend decoder.
This makes you wonder if it is possible to somehow mark the .text
code as read-only for the CPU, so that it does not re-fetch the same cache line on each iteration. Although, it probably won't make a lot of difference in a real-world case. The inlined and aligned code has a negligible count of FE-related stalls. A reduction in the fetcher work won't make it faster. And probably it would not save a lot of energy either.
I would also like to add a couple summary points on the frontend pipeline. As far as I understand it now, there are 3 main areas in the FE pipeline, with their corresponding resources:
L1-icache-loads
count its activity. Loads a line on every type of jump. So, less jumps and good alignment of the code make the fetching efficient.std::expected
for the programmer.topdown_be_bound.register
("marble" stalls): "the physical register file unable to accept an entry". I also had a case when the backend Re-Order Buffer (ROB) could not accept new micro-op entries.