assembly x86-64 masm memory-alignment micro-optimization

Using the operand-size override prefix 0x66 for instruction alignment

Recently I came across the legacy 0x66 operand-size override prefix.
Could it be used to align instructions without explicitly writing a single or multi-byte NOP instruction?

For example, adding the align 16 directive:

int   3
mov   rax,1        
align 16
add   rcx,rax

generates this disassembly:

...1000 cc               int     3
...1001 48c7c001000000   mov     rax,1
...1008 0f1f840000000000 nop     dword ptr [rax+rax]  ; <--- multi-byte NOP instruction
...1010 4803c8           add     rcx,rax              ; <--- 16-byte aligned

Removing align 16 and prepending mov rax, 1 with repeated 0x66 ignored bytes:

int   3
db    8 DUP (66h)
mov   rax,1  
add   rcx,rax

generates this disassembly:

...1000 cc                             int     3
...1001 666666666666666648c7c001000000 mov     rax,1
...1010 4803c8                         add     rcx,rax  ; <--- 16-byte aligned

Is the 0x66 alignment technique valid and faster than using align 16?

UPDATE

As suggested, it works using the 0x2E CS segment override prefix. Tested with NASM:

nop
CSbuf: times 8 db 2Eh
mov rax,strict dword 1
add rcx,rax

and the add rcx,rax was 16-byte aligned:

00007ff7`f78d1010 4801c1          add     rcx,rax

Built using these commands:

nasm -fwin64 test.asm
link.exe /subsystem:console /machine:x64 /defaultlib:kernel32.lib /defaultlib:user32.lib /defaultlib:libcmt.lib /entry:main test.obj

Solution

The basic idea is a good one, padding earlier instructions using prefixes is a cheaper way to align than using even one multi-byte NOP. (A NOP takes a slot in the decoders, in the uop cache, and in the issue/rename stage, the narrowest part of the pipeline. Also a ROB entry to track it until retirement).

That's why Intel recommends lengthening other instructions in inner loops when working around the performance pothole introduced by their microcode update for the JCC erratum.

Assemblers always should have been doing this for align directives (or with some other mechanism to specify which instructions to pad), but nobody's been motivated enough until Intel's JCC erratum with an official recommendation to do it this way. (Because unlike aligning tops of loops, the padding may have to be inside an inner loop, where it would cost front-end bandwidth every iteration if it was a NOP instead of part of other instructions.) Unfortunately this JCC-erratum mitigation has mostly been added as a separate feature by assemblers (e.g. GAS's -mbranches-within-32B-boundaries) or left to compilers, not providing a general way to avoid wasting instructions on NOPs for alignment of other points.

Your specific choices aren't ideal

See What methods can be used to efficiently extend instruction length on modern x86? - more than 3 prefixes on one instruction can create big slowdowns for some CPUs, including Silvermont-family like the E-cores in Alder Lake. So spread this out over multiple instructions in the basic block before the position you want aligned. Prefer padding later instructions so the front-end can get more instructions decoded earlier, unless you have a lot of very-short instructions (like 1 or 2 bytes each) that are short enough that a 16-byte fetch block could still include more than 5 or 6. (In anticipation of future CPUs with wide legacy decode, if there aren't any already.)

Using 7-byte mov rax, 1 in an assembler that doesn't optimize it to 5-byte mov eax, 1 is already one way to fill some bytes; 10-byte mov rax, strict qword 1 (NASM syntax; IDK if MASM can force an imm64) is another way to use more bytes. On Sandybridge-family, a 64-bit immediate fits efficiently in the uop cache (1 entry without needing an extra cycle to read it) when the 64-bit value isn't huge, i.e. is just the sign-extension of a 32-bit value. (https://agner.org/optimize/microarchitecture.pdf - Sandybridge chapter)

ds or cs prefixes are a good choice, as they have meaning for most opcodes so it's unlikely that cs mov eax, 1 would be repurposed as the encoding for some different instruction. (e.g. the way rep bsr is the encoding for lzcnt, which does something different.) It's not impossible, especially for the no-modrm mov-to-register opcodes (mov r32,imm32 or mov r64,imm64, unlike the mov r/m64, sign_extended_imm32 you're using. https://www.felixcloutier.com/x86/mov)

I wouldn't recommend using prefixes like 66h that could potentially cause LCP pre-decode stalls on Intel CPUs, and even change meaning of many instructions. (Without REX.W setting the operand-size to 64-bit, it would change the meaning for mov eax, 0x00000001 to mov ax, 0x0001 with a 00 00 left over which decodes as add [rax], al.)

I'm not 100% sure it's well defined on paper what's supposed to happen with both a 66h and a REX.W prefix. (@fuz questioned this in comments). A 67h address-size prefix is generally fine in 64-bit mode, or a segment override prefix is also good in 64-bit mode.

In practice on Skylake, the REX.W prefix wins and the 66h is ignored, and doesn't even cause false LCP stalls. But I wouldn't count on that on P6-family, and if it's not documented on paper what should happen with both 66h and REX.W, I'd worry about other vendors, or especially emulators and dynamic-translation software for that corner case.

Fun fact: Sandybridge-family doesn't LCP-stall on mov in general, but does on other instructions when a 66h prefix changes an opcode from having an imm32 to an imm16.

I just tried this, with NASM times 6 db 0x66 / mov rax, strict dword 1 (to match your encoding; NASM normally optimizes it to the architecturally equivalent mov eax,1). I put that inside a %rep 8000 block (to defeat the uop cache). It ran at 1.2IPC on Skylake, with no counts for ild_stall.lcp in perf stat.

Even with add rax, strict qword 1 (to force the add rax, sign_extended_imm32 no-modrm encoding), Skylake doesn't LCP stall, running at 1.0 IPC (bottleneck on latency). Same for add rcx, strict qword 1 for the imm32 encoding with a ModRM.

I wouldn't recommend a 66h prefix, but it happens not to break correctness (with a REX.W) or performance on Skylake. I didn't test on any other CPUs or emulators, and I'm not claiming this use of 66h is safe anywhere else. (Although it probably is on earlier Intel CPUs, at least for correctness if not performance.)