assemblyoptimizationx86alignmentno-op

How many 1-byte NOPs can Skylake execute at one cycle


I'm aligning branch targets with NOPs, and sometimes the CPU executes these NOPs, up to 15 NOPs. How many 1-byte NOPs can Skylake execute in one cycle? What about other Intel-compatible processors, like AMD? I'm interested not only in Skylake but in other microarchitectures as well. How many cycles may it take to execute a sequence of 15 NOPs? I want to know whether the extra code size and extra execution time of adding these NOPs worth its price. This is not me who adding these NOPs but an assembler automatically whenever I write an align directive.

Update: I have managed the assembler to insert multibyte NOPs automatically.


Solution

  • Skylake can generally execute four single-byte nops in one cycle. This has been true at least back to the Sandy Bridge (hereafter SnB) micro-architecture.

    Skylake, and others back to SnB, will also generally be able to execute four longer-than-one-byte nops in one cycle as well, unless they are so long as to run into front-end limitations.


    The existing answers are much more complete and explain why you might not want to use such single-byte nop instructions so I won't add more, but it's nice to have one answer that just answers the headline question clearly, I think.