assembly x86 cpu-architecture instructions conditional-move

Why does x86 only have 1 form of conditional move, not immediate or 8-bit?

I've noticed that the Conditional Move instruction is less extensible than the normal mov. For example, it doesn't support immediates and doesn't support the low-byte of a register.

Out of curiosity, why is the Cmov command much more restrictive than the general mov command? Why, for example, wouldn't both allow something like:

mov    $2, %rbx    # allowed
cmovcc $1, %rbx    # I suppose setcc %bl could be used for the '1' immediate case

As a side note, I've noticed when using Compiler Explorer that the cmovcc is used much less than the jmpcc and setcc. is this normally the case, and if so, why is it used less frequently than the other conditionals?

Solution

Being conditional, it already needs 16 different opcodes just for the cmov r, r/m form, one for each different cc condition, just like jcc and setcc (synonyms share an opcode, of course).

So even if there was "room" for another 16 0F xx opcodes, it probably wouldn't have been worth spending all that coding space when Intel was adding it for Pentium Pro. Well maybe for a sign-extended-imm8 form. That would have taken away room for other new opcodes, like for MMX and SSE instructions which Intel probably already had started to design or at least think about for Pentium-MMX and Pentium III when the ISA extensions for P6 were being finalized.

An imm8 form would be useful most of the time when you want a cmov at all (often to conditionally zero something), but it's not necessary. The RISC philosophy (which Intel was leaning into with P6¹) would favour only providing one way, and letting code use a mov-immediate to create a constant in another register if desired.

Out-of-order exec can often hide the cost of mov-immediate to put a constant in another register. Such an instruction is independent of everything else and can execute as soon as there's a spare cycle on the execution port it's scheduled to. (However, the front-end is often a real bottleneck, and static code-size does matter, so it's unfortunately not free.)

Footnote 1: RISC ideas were a big thing for the P6 microarchitecture, most notably the revolutionary idea of decoding x86 instruction to 1 or more uops for its RISC-like back-end, allowing out-of-order exec of different parts of one memory-destination instruction (load / ALU / store), for example.

But also in smaller decisions, for example P6 doesn't have hardware support for maintaining TLB coherence across the uops of one instruction. That's why adc %reg, (mem) needs more uops than you'd expect on Intel CPUs. Andy Glew (Intel architect who worked on P6) explained that in a Stack Overflow comments (which I quoted in this answer), including saying 'I was a RISC proponent when I joined P6, and my attitude was "let SW (microcode) do it".'

It's easy to see how this attitude could extend to x86 ISA design, and only providing the bare minimum form of cmov. (8-bit is hardly necessary; you can always move the whole register, and you often want to avoid partial registers in high-performance code anyway because of possible stalls. Which were even more costly on PPro than on later P6 like Core 2. Sandybridge-family made partial-register merging even cheaper.)

But this is pure speculation on my part about what factors may have influenced that design decision.

The cost (in power and die area, and achievable clock speed) of adding transistors to decode an imm8, imm32, and/or r/m8 encoding of cmov would have to be weighed against the expected real-world speedup from code being able to use it. As well as against the future cost of using up more opcode coding space.

Other than the future cost of coding-space (which let MMX and SSE1 instructions have only 2-byte opcodes), Intel might have guessed wrong on this by omitting cmov $sign_extended_imm8, %reg which would actually be useful fairly often.

It's used less because it's only useful when it's cheap to compute the result of both sides of a condition and select one, instead of just branching and only doing one. It's useful as an optimization, especially when a compiler expects that a branch would predict poorly. Purpose of cmove instruction in x86 assembly?

More general cpu-architecture background about control dependencies (branching) vs. data dependencies (cmov): difference between conditional instructions (cmov) and jump instructions

See Conditional move (cmov) in GCC compiler re: when GCC does if-conversion into branchless asm.

Using cmov can even hurt if you do it wrong (gcc optimization flag -O3 makes code slower than -O2), for cases where branch prediction would have predicted pretty accurately (e.g. on the special case of sorted input data).

On older CPUs with shorter / narrower pipelines and smaller out-of-order execution resources (so the cost of a mispredict was lower), CMOV was useful in even fewer cases. Especially on Intel before Broadwell where it takes 2 uops instead of 1. Linux Torvalds explained why it sucks for a lot of common cases, with some tests on a Core 2 CPU back in 2007: https://yarchive.net/comp/linux/cmov.html

It's certainly not rare to see compilers generate it, though, if you write code that selects from a couple values based on a condition. Clang's heuristics tend to favour using more cmov than GCC, i.e. more aggressive if-conversion to branchless.

Note that setcc doesn't get used a lot either, unless you frequently look at non-inlined versions of functions that return a boolean.

I disassembled libperl.so on my Arch Linux desktop (just picked a random large binary), compiled by GCC 10.1.0. Out of 377835 total instructions (objdump -d | egrep ' +[0-9a-f]+:'| wc -l):

setcc appeared 1783 times, often in setxx a / setxx b / or a,b to do one branch on multiple conditions.
cmovcc appeared 1737 times. objdump -drwC -Mintel /usr/lib/perl5/5.32/core_perl/CORE/libperl.so | egrep 'cmov[a-z]+ ' | wc