assemblyoptimizationx86-64attmicro-optimization

test $x,%dil vs. test $x,%edi


I want to test a bit in a register, namely the second lowest one in %rdi. Naively I would write test $2, %edi (or and $2, %edi -- I don't know if anding would be better -- the rest of the register is irrelevant at this point).

I checked what clang and gcc generate (for a dummy void TEST(long X){ if(X&2) abort(); }), and while they seem similarly split on test vs and, they both surprised me by both agreeing to address the register via %dil, not %edi.

What might be the reason for this?


Solution

  • Both ways have equal performance except for code-size; reading a low-8-bit partial register never has any penalty, unlike test $2, %bh for example; reading high-8 registers has extra latency on Haswell and later but still saves code-size and doesn't hurt front-end throughput.

    There is no test $sign_extended_imm8, r/m32, so it saves code-size to use 8-bit operand-size, even though it requires a REX prefix to encode DIL. (https://www.felixcloutier.com/x86/test)

    Since the value of x isn't needed after the test, you actually could use and $imm8, %edi (3 bytes) to save code-size, but and/jnz can't macro-fuse on AMD CPUs, or Intel before Sandybridge, so compilers prefer to only write FLAGS. I suspect nobody's implement the peephole optimization of using and instead of test with -mtune=sandybridge when the register isn't needed later.

       0:   f7 c7 02 00 00 00       test   edi,0x2    # imm32 = 2
       6:   40 f6 c7 02             test   dil,0x2    # REX prefix with no bits set
       a:   f6 c7 02                test   bh,0x2     # same byte without REX
    
       d:   83 e7 02                and    edi,0x2
      10:   40 80 e7 02             and    dil,0x2
      14:   80 e7 02                and    bh,0x2
    
      17:   0f ba e7 02             bt     edi,0x2    # can't macro-fuse with JCC