assembly x86 jit micro-optimization stack-machine

Are these the smallest possible x86 macros for these stack operations?

I'm making a stack based language as a fun personal project. So, I have some signed/unsigned 32-bit values on the stack and my goal is to write some assembly macros that operate on this stack. Ideally these will be small since they'll be used a lot. Since I'm new to x86 assembly I was wondering if you guys had any tips or improvements you could think of. I'd greatly appreciate your time, thanks!

Note: An optimizer is used after the macros are expanded to avoid cases like pop eax; push eax so don't worry about that!

<SIGNED/UNSIGNED-COMPARISONS>
pop eax
cmp dword [esp], eax
setl byte [esp]        ;setle, setg, setge, setb, setbe, seta, setae, sete, setne for other comparisons
and dword [esp], 1

<NEGATE-STACK>
neg dword [esp]

<NOT-STACK>
not dword [esp]

<DEREF-STACK>
mov eax, dword [esp]      ;eax = address
mov eax, dword [eax]      ;eax = *address 
mov dword [esp], eax      ;stack = eax

<SHIFT>
pop ecx
SHL dword [esp], cl  ;SAR, SHR

<ADD-STACK>
pop eax
add dword [esp], eax    ;sub, and, or, xor

<AND-STACK-LOGICAL>
cmp dword [esp], 0
setne byte [esp]
and dword [esp], 1
comp dword [esp+4], 0
setne byte [esp+4]
and dword [esp+4], 1
pop eax
and dword [esp], eax    ;or

<NOT-STACK-LOGICAL>
cmp dword [esp], 0
sete byte [esp]
and dword [esp], 1

<MULTIPLY-STACK>
pop eax
imul eax, dword [esp]     ;imul works for signed and unsigned
mov dword [esp], eax

<DIVIDE-UNSIGNED-STACK>
pop ebx
pop eax
mov edx, 0
div ebx
push eax    ;edx for modulo

<DIVIDE-SIGNED-STACK>
pop ebx
pop eax
cdq
idiv ebx
push eax    ;edx for modulo

Solution

Note: An optimizer is used after the macros are expanded to avoid cases like pop eax; push eax so don't worry about that!

In that case you should try to end with a result in a register that you push, allowing the optimizer to chain register operations without store/reload when you do multiple things to the top of the stack.

Push and pop are 1 byte instructions that decode to 1 uop each, and the stack engine in modern x86 CPUs handles the update of ESP, avoiding a data dependency chain through ESP or an extra uop to add/sub ESP. (Using [esp] explicitly in an addressing mode forces a stack-sync uop on Intel CPUs, though, so mixing push/pop and direct access to the stack can be worse. OTOH, add [esp], value can micro-fuse the load+add operations, where separate pop/add/push couldn't.)

Also note that minimizing code-size in bytes is usually secondary to keeping critical-path latency low (easily a problem for a stack architecture that's already going to be storing / reloading with naive JIT transliteration into machine code, not really optimizing between stack ops like a JVM would). And also to minimizing front-end uop counts. All else equal, smaller code size is usually best, though.

Not all instructions decode to a single uop; e.g. add [mem], reg decodes to two, inc [mem] decodes to 3 because the load+inc can't micro-fuse together. Presumably there won't be enough instruction-level parallelism for back-end uops to matter, so all discussion of uop counts is what Intel calls the "fused domain" for the front-end issue stage that feeds uops into the out-of-order back-end.

Just to be clear, this naive JIT of canned sequences with only minor optimization between them might work for a toy project, but real JIT compilers for stack-based bytecode like Java optimize much more, doing register allocation on a per-function basis. If you want actually good performance, this approach is a dead end. If you just want a hobby project to learn some asm and some compiler-construction, this looks like it might be fine. You'll probably want other ops, including swapping the top two elements (pop/pop / push/push).

Your compare code is nasty, creating a store-forwarding stall when and does a dword load that overlaps with the byte-store in a recent instruction. (https://agner.org/optimize/). Use a tmp register like ECX or EDX.

<SIGNED/UNSIGNED-COMPARISONS>  version 1
pop   ecx                ; 1 byte, 1 uop, maybe optimized out
xor   eax,eax            ; 2 bytes, 1 uop
cmp   dword [esp], ecx   ; 3 bytes (ESP addressing mode needs a SIB), 1 uop (micro-fused load+cmp)
setl  al          ; 3 bytes, 0F ... 2-byte opcode ;setle, setg, setge, setb, setbe, seta, setae, sete, setne for other comparisons
mov   [esp], eax         ; 3 bytes (ESP addr mode), 1 uop (micro-fused store-address + store-data)

xor-zeroing avoids a possible partial-register penalty for reading EAX after writing AL. Notice that we only write memory once, not store then reload+store again with a memory-destination and. The same optimization applies for <NOT-STACK-LOGICAL>, and somewhat in <AND-STACK-LOGICAL>

Total of 8 or 9 bytes (if leading pop optimizes out), total of 4 or 5 uops plus 1 stack-sync uop, so 5 or 6. But if we optimize for push/pop, giving uop on micro-fusion of cmp+load, in favour of hopefully optimizing away a push/pop pair with the next function:

<SIGNED/UNSIGNED-COMPARISONS>  version 2
pop    ecx                          ; 1 byte, 1 uop, hopefully optimized out
xor    eax,eax       ; EAX=0        ; 2 bytes, 1 uop
pop    edx                          ; 1 byte, 1 uop
cmp    edx, ecx                     ; 2 bytes, 1 uop
setl   al          ; 3 bytes, 1 uop ;setle, setg, setge, setb, setbe, seta, setae, sete, setne for other comparisons
push   eax                          ; 1 byte, 1 uop, hopefully optimized out
  ;; no references to [esp] needing a stack-sync uop

4 to 6 uops if the leading pop and/or trailing push optimize out. 8 to 10 bytes. Being able to optimize out the trailing push also saves an instruction in the next block if it happens. But even more importantly, avoids store/reload latency on the critical path dependency chain.

It's barely worse in the case where it can't optimize away the trailing push, and better if it can, so this is probably good.

If you can select between pop/operate/push versions vs. op [esp] versions based on whether optimizing away the push/pop is possible, that would let you choose the best of both worlds.

<DIVIDE-UNSIGNED-STACK> There's no reason to use EBX here; you can use ECX, like you're already clobbering for shifts, letting you potentially keep something useful in EBX. You can also use xor-zeroing for EDX. For some reason you decided to pop both operands here ahead of div, instead of div dword [esp]. That's fine, as discussed above push/pop can hopefully optimize away.

<DIVIDE-UNSIGNED-STACK>
pop ebx
pop eax
mov edx, 0
div ebx
push eax    ;edx for modulo

Your AND-STACK-LOGICAL should compute in registers, not memory. It's annoying to get a full dword width result and also avoid false dependencies / partial register shenanigans. For example, even GCC chooses to write DL without first xor-zeroing EDX for return a && b;: https://godbolt.org/z/hS9RrA. But fortunately we can choose wisely and mitigate the false dependency for modern CPUs by writing a register that was already part of the dependency chain leading to this instruction. (and eax,edx after writing dl would still lead to a partial-register stall on Nehalem / Core2 or earlier, but those are obsolete.)

<AND-STACK-LOGICAL>
pop   ecx
pop   edx
xor   eax, eax
test  ecx, ecx         ; peephole optimization for cmp reg,0  when you have a value in a reg
setnz al               ; same instruction (opcode) as setne, but the semantic meaning for humans is better for NZ then NE
test  edx, edx
setnz dl
and   eax, edx         ; high garbage in EDX above DL ANDs away to zero.
push  eax
; net 1 total pop = 2 pop + 1 push

This is going to be smaller as well as much better than your AND-STACK-LOGICAL, avoiding multiple store-forwarding stalls.

<DEREF-STACK> could also use some work, as Nate points out:

<DEREF-STACK>
pop eax                   ;eax = address
push dword [eax]          ;push *address

(or to optimize for push/pop elimination, mov eax, [eax] / push eax.)