performanceassemblyx86cpu-architecturebranch-prediction

What is the overhead of jumps and call-rets for CPU front-end decoder?


How jumps and call-ret pairs affect the CPU front-end decoder in the best case scenario when there are few instructions, they are well cached, and branches are well predicted?

For example, I run a simple test program on Intel N150 CPU:

// test.cpp
constexpr int N = 256'000'000;
alignas(64) unsigned int AData[N];
constexpr unsigned mask = 0b11011101011;

template <typename T>
T procItem(T item) {
  return item & mask;
}

int main() {
  for (unsigned long long i=0; i<N; i++) {
    AData[i] = i;
  }

  constexpr int n_proc = 32; // or 64
  for (int i=0; i<n_proc; i++) {
    asm("" ::: "memory");
    // prevent GCC -O3 from doing 32 and+add iterations
    // to the same element in a register (loop interchange),
    // which bizzarely also stops it from SIMD vectorizing.

    for (auto& item : AData) {
      item += procItem(item);
      //item += item & mask;
    }
  }
}

If I compile it with GCC and -Og optimization, procItem is not inlined in the assembly:

.L7:
  movl  (%rbx), %ebp
  movl  %ebp, %edi
  call  _Z8procItemIjET_S0_
  addl  %eax, %ebp
  movl  %ebp, (%rbx)
  addq  $4, %rbx

.L6:
  leaq  AData(%rip), %rax
  leaq  1024000000(%rax), %rax
  cmpq  %rax, %rbx
  jne .L7
...

_Z8procItemIjET_S0_:
  endbr64
  movl  %edi, %eax
  andl  $1771, %eax
  ret

It runs about 14-17 seconds, pushing something like 1GB/s on the system memory bus. VTune microarchitecture analysis reports that it is mostly front-end bound (43% of the pipeline slots). And most of that is due to stalls in Decode.

A compilation with -O3 -march=gracemont produces inlined SIMD code that runs 3.8s at about 20GB/s of memory bus bandwidth. (The bus maximum is 32GB/s.)

Then, if I just comment out procItem and uncomment the line to process the array items, i.e. there is no call in the assembly, -Og compilation runs at 6.3s and 10GB/s. This time VTune reports that it is Backend bound (27% of pipeline slots), mainly due to Full Reorder Buffer (17.4%) and non-memory scheduler (7.3%).

.L6:
  movl  (%rax), %edx
  movl  %edx, %ecx
  andl  $1771, %ecx
  addl  %ecx, %edx
  movl  %edx, (%rax)
  addq  $4, %rax

.L5:
  leaq  AData(%rip), %rdx
  leaq  1024000000(%rdx), %rdx 
  cmpq  %rdx, %rax
  jne .L6
...

I guess, call creates a problem for the front-end, because ret performs a jump to the contents of a register, the stack pointer register. Is that correct? Also, call and ret perform a memory access to write and read back the stack pointer. But, I guess, the front-end Decode stalls of about 40% of pipeline slots is due to the jump via a register, not the memory access?

To test this assumption, I tried to add a random jmp label loop in place of the call like this:

# test_inlined_jmp4.s
.L6:
  movl  (%rax), %edx

  mov $0, %r11
.LMYLABLE:
  movl  %edx, %ecx
  andl  $1771, %ecx
  addl  %ecx, %edx

  add $1, %r11
  cmp $4, %r11
  jl .LMYLABLE

  movl  %edx, (%rax)
  addq  $4, %rax

.L5:
...

It indeed runs much better than call-ret, with much less front-end stalls:

perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode -- ./test_inlined_jmp4

 Performance counter stats for './test_inlined_jmp4':

   265 093 691 896      instructions                     #    4,68  insn per cycle              (71,43%)
    56 668 199 782      cycles                                                                  (71,43%)
       273 745 456      topdown-bad-spec                                                        (71,43%)
     3 963 180 542      topdown-be-bound                                                        (71,43%)
    13 350 579 533      topdown-fe-bound                                                        (71,43%)
   265 615 457 686      topdown-retiring                                                        (71,43%)
    11 689 808 946      topdown_fe_bound.decode                                                 (71,43%)

      16,592255607 seconds time elapsed

It is slower simply because it executes more code. But IPC is way better, most of pipeline slots retire instructions. (This is a 5-issue CPU, so IPC close to 5 is pretty good.)

Then, I tried to unroll this loop of 4 jumps, keeping the same number of instructions:

  mov $0, %r11
#.LMYLABLE:
  # 0
  movl  %edx, %ecx
  andl  $1771, %ecx
  .loc 1 21 12 view .LVU13
  addl  %ecx, %edx

  add $1, %r11
  cmp $4, %r11
  nop

  # 1
  ...same code
  # 2
  ...
  # 3
  ...

And it runs somewhat better, with even less front-end stalls:

perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode -- ./test_inlined_unroll

 Performance counter stats for './test_inlined_unroll':

   265 096 917 837      instructions                     #    4,87  insn per cycle              (71,42%)
    54 379 876 310      cycles                                                                  (71,42%)
       280 519 725      topdown-bad-spec                                                        (71,43%)
     4 725 503 253      topdown-be-bound                                                        (71,44%)
     1 162 691 302      topdown-fe-bound                                                        (71,43%)
   265 599 486 048      topdown-retiring                                                        (71,43%)
       215 563 581      topdown_fe_bound.decode                                                 (71,42%)

      15,230292928 seconds time elapsed

So, is it correct to conclude the following:

  1. jumps via registers, like in call-ret pair, stall the front-end decoder? How does the decoder and the branch predictor work in this case? There is no predictor for the register-based jumps, right? (Jumping to code via a register as a pointer - it seems like an operation with non-trivial consequences.)

    I.e. following this point in the answer about small branches:

    When the instruction is being decoded to uops and fed into the front-end, register values aren't available; those are only available in the out-of-order execution back-end.

    Then a jump via a register (with ret) must make things much worse for the front-end. Does it stall only the decoder or other components of the front-end?

  2. a simple jmp is still slightly slower than unrolling, it has some small overhead in the decoder? Is it mainly because the branch predictor has to do some work on every branch, even in the best case scenario? By "best case scenario" I mean when there are few instructions, they are well cached, correctly predicted, and Branch Target Buffer is not saturated like in the slow jmp question?

The second point is similar to the question about small branches. But I would like to further clarify it in this angle: there is no special elision of the small-distance branches, short branches have to be speculated like any other branch, but let's consider the best case scenario when branches are speculated well, and the code is small and is supposed to fit into L1i cache (and whatever micro-op caches inside the front-end?) - is there still some remaining overhead for the front-end and its decoder in processing branch instructions (cmp and jmp)? From the PMU measurements above, it seems like the overhead is really negligible, comparing to unrolling. Can it add up and get worse in some situations? Like when there are many branches and BTB gets saturated (slow jmp) - that's a good example.


Solution

  • It looks like, just like the comments suggest, the branching itself is not the cause of the most of the performance degradation. And there is indeed not much difference between unconditional jmp and call-ret, i.e. they are covered by the branch predictor equally well.

    In such a small program, the bottleneck comes from how the fetcher works. In my case:

    I compile the code to assembly with GCC 14.2:

    # Makefile
    %.s: %.cpp
      g++ -std=c++23 -Og -g $< -S $@
    
    %: %.s
      g++ -std=c++23 -Og -g $< -o $@
    

    The non-inlined code does is not aligned to 64-byte boundary:

    main:
    ...
      movl $0, %r12d
      jmp .L5
      <no .p2align!>
      # manually added alignment directive:
      #.p2align 6
    
    .L7:
      movl  (%rbx), %ebp
      movl  %ebp, %edi
      call  _Z8procItemIjET_S0_
    ...
    

    The performance is pretty bad. Frontend is stalled a lot, mostly in "decode" stalls. There are 8B iterations (32 repetitions on an array of 256M elements). But close to 40B of L1i cache loads:

    perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode,L1-icache-loads,mem_bound_stalls.ifetch -- ./test
    
     Performance counter stats for './test':
    
       117 654 640 228      instructions                     #    2,26  insn per cycle              (55,55%)
        52 001 267 137      cycles                                                                  (55,55%)
           274 424 423      topdown-bad-spec                                                        (55,55%)
        12 346 611 654      topdown-be-bound                                                        (55,56%)
       129 060 603 366      topdown-fe-bound                                                        (55,56%)
       118 201 934 446      topdown-retiring                                                        (55,56%)
       123 176 125 025      topdown_fe_bound.decode                                                 (55,56%)
        41 531 138 121      L1-icache-loads                                                         (55,56%)
            26 774 607      mem_bound_stalls.ifetch                                                 (55,55%)
    
          14,566056453 seconds time elapsed
    
          14,243623000 seconds user
           0,314947000 seconds sys
    

    If add .p2align 6 manually, it gets much better:

    perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode,L1-icache-loads,mem_bound_stalls.ifetch -- ./test
    
     Performance counter stats for './test':
    
       117 627 976 899      instructions                     #    3,26  insn per cycle              (55,55%)
        36 036 207 473      cycles                                                                  (55,55%)
           242 416 565      topdown-bad-spec                                                        (55,55%)
        15 235 688 116      topdown-be-bound                                                        (55,55%)
        46 427 105 158      topdown-fe-bound                                                        (55,56%)
       118 128 368 138      topdown-retiring                                                        (55,56%)
        43 348 565 514      topdown_fe_bound.decode                                                 (55,56%)
        25 130 455 182      L1-icache-loads                                                         (55,56%)
            22 139 479      mem_bound_stalls.ifetch                                                 (55,55%)
    
          10,250459485 seconds time elapsed
    
           9,926537000 seconds user
           0,321920000 seconds sys
    

    Then I tried to move the _Z8procItemIjET_S0_ code inside the same cache line where the main loop sits in the test program. There is just enough space to fit it after an unconditional jmp:

    gdb ./test
    > disas /r main
    ...
       0x000000000000117e <+62>:    66 90               xchg   %ax,%ax.  # the p2align nop
    ...
       0x0000000000001180 <+64>:    8b 2b               mov    (%rbx),%ebp
       0x0000000000001182 <+66>:    89 ef               mov    %ebp,%edi
       0x0000000000001184 <+68>:    e8 77 00 00 00      call   0x1200 <_Z8procItemIjET_S0_>
       0x0000000000001189 <+73>:    01 c5               add    %eax,%ebp
       0x000000000000118b <+75>:    89 2b               mov    %ebp,(%rbx)
       0x000000000000118d <+77>:    48 83 c3 04         add    $0x4,%rbx
       0x0000000000001191 <+81>:    48 8d 05 e8 2e 00 00    lea    0x2ee8(%rip),%rax        # 0x4080 <AData>
       0x0000000000001198 <+88>:    48 8d 80 00 00 09 3d    lea    0x3d090000(%rax),%rax
       0x000000000000119f <+95>:    48 39 c3            cmp    %rax,%rbx
       0x00000000000011a2 <+98>:    75 dc               jne    0x1180 <main()+64>
       0x00000000000011a4 <+100>:   41 83 c4 01         add    $0x1,%r12d
       0x00000000000011a8 <+104>:   41 83 fc 1f         cmp    $0x1f,%r12d
       0x00000000000011ac <+108>:   7f 09               jg     0x11b7 <main()+119>
       0x00000000000011ae <+110>:   48 8d 1d cb 2e 00 00    lea    0x2ecb(%rip),%rbx        # 0x4080 <AData>
       0x00000000000011b5 <+117>:   eb da               jmp    0x1191 <main()+81>
    ...
    

    I reduced the add and removed endbr64 from the procItem code, and it did fit:

       0x000000000000117e <+62>:  66 90               xchg   %ax,%ax
       0x0000000000001180 <+64>:  8b 2b               mov    (%rbx),%ebp
       0x0000000000001182 <+66>:  89 ef               mov    %ebp,%edi
       0x0000000000001184 <+68>:  eb 2e               jmp    0x11b4 <main()+116>
       0x0000000000001186 <+70>:  01 c5               add    %eax,%ebp
       0x0000000000001188 <+72>:  89 2b               mov    %ebp,(%rbx)
       0x000000000000118a <+74>:  48 83 c3 04         add    $0x4,%rbx
       0x000000000000118e <+78>:  48 8d 05 eb 2e 00 00  lea    0x2eeb(%rip),%rax        # 0x4080 <AData>
       0x0000000000001195 <+85>:  48 8d 80 00 00 09 3d  lea    0x3d090000(%rax),%rax 
       0x000000000000119c <+92>:  48 39 c3            cmp    %rax,%rbx
       0x000000000000119f <+95>:  75 df               jne    0x1180 <main()+64>
       0x00000000000011a1 <+97>:  41 83 c4 01         add    $0x1,%r12d
       0x00000000000011a5 <+101>: 41 83 fc 1f         cmp    $0x1f,%r12d
       0x00000000000011a9 <+105>: 7f 10               jg     0x11bb <main()+123>
       0x00000000000011ab <+107>: 48 8d 1d ce 2e 00 00  lea    0x2ece(%rip),%rbx        # 0x4080 <AData>
       0x00000000000011b2 <+114>: eb da               jmp    0x118e <main()+78>
       0x00000000000011b4 <+116>: 89 f8               mov    %edi,%eax
       0x00000000000011b6 <+118>: 83 e0 01            and    $0x1,%eax
       0x00000000000011b9 <+121>: eb cb               jmp    0x1186 <main()+70>
    

    But the L1i cache loads did not get smaller, and the performance did not improve.

    Now, the inlined code does only 8B L1i cache loads, instead of the 24B of the not-inlined but aligned code:

    perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode,L1-icache-loads,mem_bound_stalls.ifetch -- ./test_inlined
    
     Performance counter stats for './test_inlined':
    
        84 821 505 867      instructions                     #    4,00  insn per cycle              (55,52%)
        21 227 922 348      cycles                                                                  (55,54%)
           220 958 811      topdown-bad-spec                                                        (55,56%)
        16 025 540 940      topdown-be-bound                                                        (55,58%)
         4 446 478 843      topdown-fe-bound                                                        (55,58%)
        85 287 580 988      topdown-retiring                                                        (55,58%)
         3 011 559 579      topdown_fe_bound.decode                                                 (55,56%)
         8 721 798 673      L1-icache-loads                                                         (55,55%)
            41 841 088      mem_bound_stalls.ifetch                                                 (55,53%)
    
           6,134054414 seconds time elapsed
    
           5,821809000 seconds user
           0,310936000 seconds sys
    

    And the code is aligned:

    $ less test_inlined.s
    main:
      ...
      movl  $0, %esi
      jmp .L4
      .p2align 6
    .L6:
      movl  (%rax), %edx
      movl  %edx, %ecx
      andl  $1771, %ecx
    ...
    

    And in GDB:

    (gdb) disas /r main
    ...
       0x000000000000117d <+61>:  0f 1f 00            nopl   (%rax)
    
       0x0000000000001180 <+64>:  8b 10               mov    (%rax),%edx
       0x0000000000001182 <+66>:  89 d1               mov    %edx,%ecx
       0x0000000000001184 <+68>:  81 e1 eb 06 00 00   and    $0x6eb,%ecx
    ...
    

    If I comment out the .p2align from the inlined code (or change it to .p2align 2 which still does not set it to 64-byte boundary), I get twice more L1i cache loads and many decode stalls:

    perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode,L1-icache-loads,mem_bound_stalls.ifetch -- ./test_inlined
    
     Performance counter stats for './test_inlined':
    
        84 745 317 310      instructions                     #    3,01  insn per cycle              (55,56%)
        28 128 515 432      cycles                                                                  (55,56%)
           223 781 428      topdown-bad-spec                                                        (55,56%)
         9 861 892 463      topdown-be-bound                                                        (55,56%)
        45 167 677 793      topdown-fe-bound                                                        (55,56%)
        85 318 251 057      topdown-retiring                                                        (55,55%)
        42 963 051 680      topdown_fe_bound.decode                                                 (55,55%)
        16 906 109 601      L1-icache-loads                                                         (55,55%)
            23 840 977      mem_bound_stalls.ifetch                                                 (55,56%)
    
           7,939475212 seconds time elapsed
    
           7,639294000 seconds user
           0,298972000 seconds sys
    

    In this case, the first instruction of the main loop sits at the end of one of L1i cache lines, and the second instruction actually spans across two cache lines:

       0x000000000000117d <+61>:    8b 10               mov    (%rax),%edx
       0x000000000000117f <+63>:    89 d1               mov    %edx,%ecx
       0x0000000000001181 <+65>:    81 e1 eb 06 00 00   and    $0x6eb,%ecx
    ...
       0x00000000000011a0 <+96>:    75 db               jne    0x117d <main()+61>
    

    The fetcher loads 1 cache line where only the last couple bytes are needed. The decoder parses them but flushes most of its work. Now I think that it indeed makes sense to report it as "Decode stalls" in the PMU counters and VTune, and not some kind of "Fetch stalls". The fetcher does not stall, it loads the bytes. The decoding turns out in-efficient, because the used instructions are on L1i cache boundaries, so little of the decoder work is actually used.

    Then, to figure out why L1i cache loads would not decrease when the procedure code was moved to the same L1i cache line as the main loop, I tried to add a dummy loop inside the inlined and aligned code:

      .p2align 6
    
    .L6:
      movl  (%rax), %edx
      movl  %edx, %ecx
      andl  $1771, %ecx
      addl  %ecx, %edx
    
      jmp .LMYLABLE_dummy
    .LMYLABLE_dummy:
    
      movl  %edx, (%rax)
      addq  $4, %rax
    ...
    

    With the dummy jmp, the inlined and aligned code also gets +8B of L1i-cache loads:

    perf stat -e instructions,cycles,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring,topdown_fe_bound.decode,L1-icache-loads,mem_bound_stalls.ifetch -- ./test_inlined
    
     Performance counter stats for './test_inlined':
    
        93 021 827 918      instructions                     #    3,30  insn per cycle              (55,55%)
        28 154 529 729      cycles                                                                  (55,56%)
           234 639 930      topdown-bad-spec                                                        (55,56%)
         9 211 576 323      topdown-be-bound                                                        (55,55%)
        37 637 161 357      topdown-fe-bound                                                        (55,56%)
        93 487 157 601      topdown-retiring                                                        (55,56%)
        32 428 181 436      topdown_fe_bound.decode                                                 (55,56%)
        16 911 692 541      L1-icache-loads                                                         (55,56%)
            24 980 016      mem_bound_stalls.ifetch                                                 (55,56%)
    
           7,939847494 seconds time elapsed
    
           7,626760000 seconds user
           0,310990000 seconds sys
    

    So, in the case of the not-inlined but aligned code, one iteration of the main loop creates 3 L1i cache loads: 1) the first load of the iteration code, 2) load on call, 3) load on ret. Even if the three loads are done on the same L1i cache line, it slows down the frontend decoder.

    This makes you wonder if it is possible to somehow mark the .text code as read-only for the CPU, so that it does not re-fetch the same cache line on each iteration. Although, it probably won't make a lot of difference in a real-world case. The inlined and aligned code has a negligible count of FE-related stalls. A reduction in the fetcher work won't make it faster. And probably it would not save a lot of energy either.

    I would also like to add a couple summary points on the frontend pipeline. As far as I understand it now, there are 3 main areas in the FE pipeline, with their corresponding resources: