[SOLVED] In x86_64, does a 32-bit cmov clear the top bits if the condition is false?

In x86_64, does a 32-bit cmov clear the top bits if the condition is false?

In 64-bit mode on x86, most 32-bit arithmetic operations clear the top 32 bits of the destination register. What if the arithmetic operation is a "cmov" instruction, and the condition is false? (This case does not seem to be covered in the reference manuals I've looked at).

Solution

It always zero-extends into the destination, like all instructions that write a 32-bit register.

Think of CMOV as always writing its destination: it's an ALU select operation (3 inputs: 2 integer operands and flags, 1 output).

It's not like ARM 32-bit mode predicated instructions that truly act like a NOP when the condition is false.

(For the same reason, cmovcc reg, [mem] always loads the memory operand, even if the condition is false, and doesn't do fault-suppression on a bad address. Again, it's not the move itself that's conditional, it's moving the result of a conditional-select operation. AArch64 picked a better name for their equivalent of the same instruction, csel.)

There is one case where a 32-bit destination may not be zero-extended: bsr and bsf r32,r/m32 when the source is zero leaves the destination unmodified. (Only documented by AMD (If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register.), but implemented by Intel as well). In practice on Intel CPUs at least, this includes leaving the upper bits unmodified after an instruction like bsf eax, ecx. I haven't tested AMD.

(This is why BSF and BSR have "false" dependencies on the destination: implementing this behaviour branchlessly requires a true dependency. It's only a false output dependency for LZCNT/TZCNT/POPCNT on Intel that run on the same execution unit but always overwrite it.)

(Wikipedia) claims there's some kind of difference between Intel and AMD for the upper bits after bsf r32, r/m32. They seem to be saying that Intel (or maybe AMD; phrasing is somewhat ambiguous) leaves the upper bits undefined for the source=0 case, instead of unmodified.

It seems always unmodified in my testing on Sandybridge-family and Core 2, but I don't have access to a P4 Nocona / Prescott, which was the first-gen IA-32e microarchitecture.

The Wikipedia editor who wrote that may just be misinterpreting Intel's documentation which says the whole destination register is "undefined" in this case. (But it's normal for Intel to, in silicon, go beyond what they guarantee on paper, so existing software they care about, e.g. Windows, keeps working). IDK if there's another source for that claim, so I guess [citation-needed] would truly be appropriate here.