Sometimes we purposefully leave NOPs in a function for later runtime patching. Instead of:
.nops 16
Why not:
jmp 0f
.nops 14
0:
Or, if the amount that you need to patch in, varies up to a maximum:
.rept 8
jmp 0f
.endr
0:
The advantage of using jumps like this is that the CPU should only spend any time on the first instruction and then jump over the rest. Is there a reason this isn't more widely used? One possible reason is that unconditional jumps still take up branch predictor slots. Another is that multibyte NOPs may allow you to express more noop bytes per instruction. But I assume once you're above some number of bytes, there is a performance advantage?
It can indeed be worth jumping if the block of NOPs is long enough, but 40+ bytes is only a few NOPs and probably a borderline case, not a big win either way.
You never want to use multiple 1-byte NOPs if they're going to actually execute, that would be horrible, filling up the uop cache and maybe even making this 32-byte block have too many total uops to be cacheable. And also wasting a lot of space in the ReOrder Buffer (ROB), limiting the ability of out-of-order exec to see past it.
The upside of a jump is that it only takes 1 ROB entry. And, if the front-end handles it well enough, only 1 slot in the front-end, entering the back-end along with multiple other uops of useful work, with no lost cycles of alloc/rename. (Issue in Intel terminology, dispatch in everyone else's.) But that's potentially a big if; a taken jump can easily create a bubble in the front-end. Buffering between stages can absorb that bubble, especially if surrounding code has plenty of straight-line blocks.
Code-fetch usually happens in 16 or 32-byte chunks in modern x86. But running from the uop cache, Zen 4/5 can run 12 NOPs / cycle, handling them in pairs. (With 4-byte NOPs). https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine has a benchmark. Most other CPUs can't pair them, so NOP throughput = pipeline width. (Or worse for Nehalem, where they even needed an execution unit.)
Another downside to jmp
is that it ends a uop-cache line on Intel Sandybridge-family CPUs, so despite being only 1 uop, it can worst-case take up as much space in the uop cache as 6 NOPs. But of course usually less, unless it's the first insn in a 32B block (https://agner.org/optimize/microarchitecture.pdf#page=125)
I don't know for sure if that's still the case in Intel's latest P-cores, or in Zen's uop cache. Intel E-cores don't use a uop-cache, they use clustered decode where if I understand correctly, different clusters alternate in handling groups of instructions.
Some assemblers, like GAS, and NASM with %use smartalign
/ ALIGNMODE p6
1), have a size threshold for when it's worth jumping over padding vs. letting NOPs execute. Obviously that threshold would be much smaller if you only used single-byte NOPs. The default for directives like .p2align
is at least 8 to 11 byte NOPs which all CPUs can handle efficiently, or even 15-byte NOPs which most but not all can.
GAS for x86-64 chooses to jump when padding 126 bytes (would have been 11 or 12 nops), but not when padding 62 bytes, with 6 nops. Seems somewhat reasonable.
I tested with xor %eax,%eax; .p2align 6
or 7
. (Godbolt with GAS, Clang, and NASM; clang's built-in assembler never jumps.)
Footnote 1: ALIGNMODE p6
allow 0f 1f
long-NOPs, an opcode that was new in Pentium Pro (P6), and is baseline for x86-64. https://www.nasm.us/xdoc/2.16.03/html/nasmdoc6.html#section-6.2
Even with that, the smartalign
macro package does a bad job, using at most 8-byte long-nops, not 11 which I think all (modern?) CPUs can handle efficiently.
Its default jump threshold is 16 bytes, which seems unreasonably small even limiting itself to 8-byte NOPs. It's clearly not very well tuned for modern CPUs.