visual-c++assemblyx86visual-c++-2010

Why does VC++ 2010 often use ebx as a "zero register"?


Yesterday I was looking at some 32 bit code generated by VC++ 2010 (most probably; don't know about the specific options, sorry) and I was intrigued by a curious recurring detail: in many functions, it zeroed out ebx in the prologue, and it always used it like a "zero register" (think $zero on MIPS). In particular, it often:

Most importantly, registers on "classic" x86 are a scarce resource, if you start having to spill registers you waste a lot of time for no good reason; why waste one through all the function just to keep a zero in it? (still, thinking about it, I don't remember seeing much register spillage in functions that used this "zero-register" pattern).

So: what am I missing? Is it a compiler blooper or some incredibly smart optimization that was particularly interesting in 2010?

Here's an excerpt:

    ; standard prologue: ebp/esp, SEH, overflow protection, ... then:
    xor     ebx, ebx
    mov     [ebp+4], ebx        ; zero out some locals
    mov     [ebp], ebx
    call    function_1
    xor     ecx, ecx            ; ebx _not_ used to zero registers
    cmp     eax, ebx            ; ... but used for compares?! why not test eax,eax?
    setnz   cl                  ; what? it goes through cl to check if eax is not zero?
    cmp     ecx, ebx            ; still, why not test ecx,ecx?
    jnz     function_body
    push    123456
    call    throw_something
function_body:
    mov     edx, [eax]
    mov     ecx, eax            ; it's not like it was interested in ecx anyway...
    mov     eax, [edx+0Ch]
    call    eax                 ; virtual method call; ebx is preserved but possibly pushed/popped
    lea     esi, [eax+10h]
    mov     [ebp+0Ch], esi
    mov     eax, [ebp+10h]
    mov     ecx, [eax-0Ch]
    xor     edi, edi            ; ugain, registers are zeroed as usual
    mov     byte ptr [ebp+4], 1
    mov     [ebp+8], ecx
    cmp     ecx, ebx            ; why not test ecx,ecx?
    jg      somewhere

label1:
    lea     eax, [esi-10h]
    mov     byte ptr [ebp+4], bl    ; ok, uses bl to write a zero to memory
    lea     ecx, [eax+0Ch]
    or      edx, 0FFFFFFFFh
    lock xadd [ecx], edx
    dec     edx
    test    edx, edx            ; now it's using the regular test reg,reg!
    jg      somewhere_else

Notice: an earlier version of this question said that it used mov reg,ebx instead of xor ebx,ebx; this was just me not remembering stuff correctly. Sorry if anybody put too much thought trying to understand that.


Solution

  • Everything you commented on as odd looks sub-optimal to me. test eax,eax sets all flags (except AF) the same as cmp against zero, and is preferred for performance and code-size.

    On P6 (PPro through Nehalem), reading long-dead registers is bad because it can lead to register-read stalls. P6 cores can only read 2 or 3 not-recently-modified architectural registers from the permanent register file per clock (to fetch operands for the issue stage: the ROB holds operands for uops, unlike on SnB-family where it only holds references to the physical register file).

    Since this is from VS2010, Sandybridge wasn't released yet, so it should have put a lot of weight on tuning for Pentium II/III, Pentium-M, Core2, and Nehalem where reading "cold" registers is a possible bottleneck.

    IDK if anything like this ever made sense for integer regs, but I don't know much about optimizing for CPUs older than P6.


    The cmp / setz / cmp / jnz sequence looks particularly braindead. Maybe it came from a compiler-internal canned sequence for producing a boolean value from something, and it failed to optimize a test of the boolean back into just using the flags directly? That still doesn't explain the use of ebx as a zero-register, which is also completely useless there.

    Is it possible that some of that was from inline-asm that returned a boolean integer (using a silly that wanted a zero in a register)?

    Or maybe the source code was comparing two unknown values, and it was only after inlining and constant-propagation that it turned into a compare against zero? Which MSVC failed to optimize fully, so it still kept 0 as a constant in a register instead of using test?


    (the rest of this was written before the question included code).

    Sounds weird, or like a case of CSE / constant-hoisting run amok. i.e. treating 0 like any other constant that you might want to load once and then reg-reg copy throughout the function.

    Your analysis of the data-dependency behaviour is correct: moving from a register that was zeroed a while ago essentially starts a new dependency chain.


    When gcc wants two zeroed registers, it often xor-zeroes one and then uses a mov or movdqa to copy to the other.

    This is sub-optimal on Sandybridge where xor-zeroing doesn't need an execution port, but a possible win on Bulldozer-family where mov can run on the AGU or ALU, but xor-zeroing still needs an ALU port.

    For vector moves, it's a clear win on Bulldozer: handled in register rename with no execution unit. But xor-zeroing of an XMM or YMM register still needs an execution port on Bulldozer-family (or two for ymm, so always use xmm with implicit zero-extension).

    Still, I don't think that justifies tying up a register for the duration of a whole function, especially not if it costs extra saves/restores. And not for P6-family CPUs where register-read stalls are a thing.