assemblyx86-64nasmmicro-optimization

Is it "too clever" for using LEA to load constant to register?


I'm studying x86-64 NASM and here is current situation:

For first, I wrote somewhat straight, easy-to-read code. However, I found somewhat "clever" ways to initialize registers to reduce instruction length.

I want to know whether these clever things could bring real reward, or do more harm than good.

This is the first code with straight way:

.loop:
    mov rax, -1
    mov rdx, 1 ; **
    mov rsi, 2 ; **

    ; ... loop body

    dec rcx
    jnz .loop

(**: The assembler actually emitted these lines as mov edx, 1 and mov esi, 2. Later I found that the assembler optimized them for me because writing EDX/ESI will zero-out the upper 32 bits of RDX/RSI.)

These are 17 bytes of beginning and 5 bytes of ending.

This is the second code with clever way:

.loop:
    xor eax, eax
    dec rax
    lea edx, [rax+2] ; ***
    lea esi, [rdx+1] ; ***

    ; ... loop body

    loop .loop

(***: I tried various combinations of 32-bit / 64-bit registers and these had the shortest instruction length.)

These are 11 bytes of beginning and 2 bytes of ending.


Solution

  • Whether it's a good idea to do this or not depends on your objective. Usually, it is not a good idea.

    If your objective is ease of understanding, you should avoid these tricks as they make your code harder to understand.

    If your objective is code size reduction, it might indeed be a good idea to make use of such tricks. You can do even better than you already did though; for example, you could do or rax, -1 to set rax to -1 with only 4 bytes. Or push -1 followed by pop rax for only 3 bytes.

    However, usually the objective is performance. Now when you optimise for performance, some tricks help, but others are detrimental. In particular, all the tricks you showed us in your question are detrimental to performance:

    Note that when optimising for performance, occasionally it might still be a good idea to optimise for size. This is because longer code sequences take more space in the instruction cache, blocking other code from being cached. In big programs whose hot code paths do not entirely fit into L1 instruction cache, performance can benefit from code size optimisations, especially in cold paths that are rarely executed. However, this is a tricky thing to evaluate and strategies must be adapted to the case at hand. Let benchmarks guide your decisions in any case.