assemblyx86movex86-64

How to move two 32-bit registers in to one 64-bit register?


Let's say that I want to put two 32-bit registers (EAX as low 32-bit dword and EDX as high 32-bit dword) into RAX. I have found one way:

shl   rdx, 32
or    rax, rdx

This method works only if we are sure that bits 32 to 63 of RAX are 0. If we are not sure about that, then we must first clear the high 32-bit dword, like:

mov   eax, eax      //Should clear the high 32-bit dword of RAX

Solution

  • Perhaps this is a tad better:

    shl     rax,32
    shrd    rax,rdx,32
    

    Does not assume that high dwords are zero.


    Note that shrd is a bit slow on AMD, e.g. 4 uops on Zen 4. https://uops.info/. On Intel P-cores, it's 1 uop but 3 cycle latency, and can only run on port 1. The 64-bit version is much slower on Intel E-cores; 14 uops.

    This is one option if you don't know RAX = zero_extend(EAX), but that is the case after any instruction that writes EAX, including things like RDTSC. But not guaranteed by the mainstream calling-conventions for functions that return a 32-bit value in EAX. e.g. int foo() { return long_long_func(); } will return with high garbage after optimizing the function call to a tailcall.

    It's slower for both throughput and latency on many modern CPUs (AMD, and Intel E-cores) than mov ecx, eax / shl / or rcx, rdx. On Intel P-cores, that's worse front-end throughput cost (3 uops instead of 4) but better latency. (And avoids needing a port-1 uop for back-end throughput, in case you have a lot of those.)

    (Producing the result in a register other than RAX allows mov-elimination to work. But it's actually fine to do mov eax,eax with 1 cycle latency if both inputs are ready at the same time, since the zero-extension mov runs in parallel with the shl, both creating inputs for or.)