assembly optimization x86 stack cpu-registers

Why do we use sub esp, 4 instead of push a register in assembly?

If we use

push ecx

we should use one byte in the opcode, if we use

sub esp, 4

I think we should use 2 bytes? I've tried to read the documentation but I didn't understand much. The reason is the same as in

xor eax, eax

instead of

mov eax, 0

Solution

TL:DR: Clang already uses push. GCC doesn't except at -Os. I haven't benchmarked but push looks very reasonable.

Code-size isn't everything. A dummy push is still a real store that takes up a store-buffer entry until it commits to cache. In fact code-size is usually the last thing to worry about, only when all else is equal (number of front-end uops, avoiding back-end bottlenecks, avoiding any performance pitfalls).

Historically (16-bit x86 before CPUs had caches), push cx would probably not have been faster than sub sp, 2 (3 bytes) or dec sp / dec sp (2 bytes) on those old CPUs where memory bandwidth was the main factor in performance (including for code-fetch). Optimizing for speed on 8088 especially is often the same as optimizing for code size, but not always when the smaller instructions involve extra memory accesses.

The reason xor eax,eax is still preferred is that later CPUs were able to make it still at least as fast even apart from the code-size advantage. (What is the best way to set a register to zero in x86 assembly: xor, mov or and?)

On later CPUs like PPro, push decoded to multiple uops (to adjust ESP and separately to store). So on those CPUs, despite the smaller code size, it costs more in the front-end. Or on P5 Pentium (which didn't decode complex instructions into multiple uops), push stalled the pipeline temporarily and was often avoided by compilers even when the store-to-memory side effect was desired.

But finally, around Pentium-M, CPUs got a "stack engine" that handles the ESP-update part of stack operations outside of the out-of-order back-end, making it single-uop and zero latency (for the dep chain through ESP). As you can see from that link, the stack-sync uops that the stack engine has to insert sometimes do make sub esp,4 cost more than push, if you weren't already going to reference esp directly in the back end before the next stack op. (like call)

IDK if it really would have been a good idea to start using dummy push ecx on CPUs that old, or if limited store-buffer sizes meant that it wasn't a good idea to use up execution resources on doing dummy stores, even to cache lines that were almost certainly hot (the top of the stack).

But anyway, modern compilers do use this peephole optimization, especially in 64-bit mode where needing to adjust the stack by only one push is common. Modern CPUs have large store buffers.

void foo();

int bar() {
    foo();
    return 0;
}

Clang has been doing this for several years. e.g. with current clang 10.0 -O3 (optimize for speed over size) on Godbolt

bar():
        push    rax
        call    foo()
        xor     eax, eax
        pop     rcx
        ret

GCC does this at -Os, but not at -O3 (I tried with -march=skylake, still chooses to use sub.)

It's less easy to construct a case where sub esp,4 would be useful, but this works:

int bar() {
    volatile int arr[1]= {0};
    return 0;
}

clang10.0 -m32 -O3 -mtune=skylake

bar():                                # @bar()
        push    eax
        mov     dword ptr [esp], 0     # missed optimization for push 0
        xor     eax, eax
        pop     ecx
        ret

Unfortunately compilers don't spot the fact that push 0 could have both initialized and reserved space for the volatile int object, replacing both push eax and mov dword [esp], 0 - What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?