If we use
push ecx
we should use one byte in the opcode, if we use
sub esp, 4
I think we should use 2 bytes? I've tried to read the documentation but I didn't understand much. The reason is the same as in
xor eax, eax
instead of
mov eax, 0
TL:DR: Clang already uses push
. GCC doesn't except at -Os
. I haven't benchmarked but push
looks very reasonable.
Code-size isn't everything. A dummy push is still a real store that takes up a store-buffer entry until it commits to cache. In fact code-size is usually the last thing to worry about, only when all else is equal (number of front-end uops, avoiding back-end bottlenecks, avoiding any performance pitfalls).
Historically (16-bit x86 before CPUs had caches), push cx
would probably not have been faster than sub sp, 2
(3 bytes) or dec sp
/ dec sp
(2 bytes) on those old CPUs where memory bandwidth was the main factor in performance (including for code-fetch). Optimizing for speed on 8088 especially is often the same as optimizing for code size, but not always when the smaller instructions involve extra memory accesses.
The reason xor eax,eax
is still preferred is that later CPUs were able to make it still at least as fast even apart from the code-size advantage. (What is the best way to set a register to zero in x86 assembly: xor, mov or and?)
On later CPUs like PPro, push
decoded to multiple uops (to adjust ESP and separately to store). So on those CPUs, despite the smaller code size, it costs more in the front-end. Or on P5 Pentium (which didn't decode complex instructions into multiple uops), push
stalled the pipeline temporarily and was often avoided by compilers even when the store-to-memory side effect was desired.
But finally, around Pentium-M, CPUs got a "stack engine" that handles the ESP-update part of stack operations outside of the out-of-order back-end, making it single-uop and zero latency (for the dep chain through ESP). As you can see from that link, the stack-sync uops that the stack engine has to insert sometimes do make sub esp,4
cost more than push
, if you weren't already going to reference esp
directly in the back end before the next stack op. (like call
)
IDK if it really would have been a good idea to start using dummy push ecx
on CPUs that old, or if limited store-buffer sizes meant that it wasn't a good idea to use up execution resources on doing dummy stores, even to cache lines that were almost certainly hot (the top of the stack).
But anyway, modern compilers do use this peephole optimization, especially in 64-bit mode where needing to adjust the stack by only one push is common. Modern CPUs have large store buffers.
void foo();
int bar() {
foo();
return 0;
}
Clang has been doing this for several years. e.g. with current clang 10.0 -O3 (optimize for speed over size) on Godbolt
bar():
push rax
call foo()
xor eax, eax
pop rcx
ret
GCC does this at -Os
, but not at -O3
(I tried with -march=skylake
, still chooses to use sub
.)
It's less easy to construct a case where sub esp,4
would be useful, but this works:
int bar() {
volatile int arr[1]= {0};
return 0;
}
clang10.0 -m32 -O3 -mtune=skylake
bar(): # @bar()
push eax
mov dword ptr [esp], 0 # missed optimization for push 0
xor eax, eax
pop ecx
ret
Unfortunately compilers don't spot the fact that push 0
could have both initialized and reserved space for the volatile int
object, replacing both push eax
and mov dword [esp], 0
- What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?