I have checked uops table (https://uops.info/table.html) and I found that TP for jmp rel8
is far greater than for jmp rel32
.
Does this mean that jmp rel8
is slower than jmp rel32
?
jmp rel32
With unroll_count=500 and no inner loop
Code:
0: e9 00 00 00 00 jmp 0x5
Show nanoBench command
Results:
Instructions retired: 1.0
Core cycles: 2.75
Reference cycles: 2.05
jmp rel8
With unroll_count=500 and no inner loop
Code:
0: eb 00 jmp 0x2
Show nanoBench command
Results:
Instructions retired: 1.0
Core cycles: 5.84
Reference cycles: 4.61
That's not a very representative measurement. One per 2 cycle throughput is normal for taken branches, or 1/clock for loop branches in tiny loops. But branch prediction can do worse with more branches per 16-byte block of code depending on the microarchitecture, so packing jmp next_instruction
(jmp rel8=0
) is bad. (Especially when you put 500 of them in a row, like in Slow jmp-instruction)
That 5.84 number looks like Alder Lake P-cores. They came up with different numbers for other uarches; it matters a lot which architecture you look at for something this low-level.
Anyway, I think the key point here is that https://uops.info/ doesn't benchmark taken jumps very well; they use the same test harness as for other instructions (unroll a lot of times), leading to poor results that don't really characterize it well.
Agner Fog's instruction tables report different numbers (https://agner.org/optimize/), e.g. 1-2 cycle throughput for relative jmp
on Skylake and Ice Lake, same as most earlier Intel. That's realistic if you have jumps inside a loop, so it's the same few jump instructions that execute in sequence.
But uops.info measured 2.12c or 4.80c for Skylake, way higher, something you hopefully only run into with artificial microbenchmarks.