On Intel AVX, there is a possibility of branchless code. Instead of branching for case0 or case1, you can compute both cases, and blend the results based on a condition.
AVX does this 8 way for float using the vblendps instruction.
You can also do this in a scalar way, without a vector, using the x86 instruction CMOVcc which performs a move operation, conditionally.
NOTE: ARM has CSEL and NEON has VBSL.
Can RISCV64 do a scalar move like this, so that you do not have to branch for
a = c ? x : y;
As I understand, RISCV implementations are in-order, so it would benefit even more than x86 when not having to branch. (The latter can at least shuffle around some instructions, and even branch speculatively to hide latency.)
The closest I can find w.r.t branchless operation for riscv is SLT (Set Less Than) but that sets to 1 or 0, and then would need multiplications? Wouldn't it be more useful to have SLT set to -1 or 0 instead, so that we can AND that?
When doing:
int foo(int a, int b, int x, int y)
{
return a < b ? x : y;
}
I tried a poor-man's version of branchless using SLT. I am not sure if I did it completely right, by using bitmask as 0 - condition(0|1), I came up with:
branchless:
SLT t0,a0,a1
SUB t0,zero,t0
NOT t1,t0
AND t0,a2,t0
AND t1,a3,t1
OR a0,t0,t1
RET
.size branchless, .-branchless
as the branchless version of:
branched:
BGE a0,a1,.L2
MV a3,a2
.L2:
MV a0,a3
RET
.size branched, .-branched
I wonder if I used too many instructions for this, but I measured the branching version to be slightly faster than the non-branching one on random data, but not by much.
Update: see sh1's answer for the current situation: there's a conditional-zero instruction, like cmov
from x0
. The full cmov
was dropped from the planned discussions before extension B made it to v1.0 (and extension B was split into some separate parts). An article has some details and links on the situation as of mid 2023.
Current compilers no longer support b
as a single-letter extension name either.
The proposed RISC-V extension B includes cmov
(with 4 operands: 3 inputs and a separate destination!). (Version 0.93 was current when the rest of this answer was written.)
I think David Patterson (one of the lead architects behind MIPS and RISC-V) really dislikes cmov
(along with short-vector SIMD like SSE/AVX) and thinks CPUs should specially handle "hammock" branches (that jump forward over a single instruction like a move) if they want to do that. Something like that. So this seems to be a case of philosophical purity getting in the way of including useful instructions. (AArch64 is a much more pragmatic design, still being RISC in the ways that matter for a high-performance implementation.)
And/or perhaps a desire to limit instructions to at most 2 inputs, if there aren't any other 3-input instructions. That means a scalar pipeline only needs 2 register read ports, not 3, if it strictly follows this restriction. (That also means no add-with-carry, making extended-precision math quite a pain for numbers wider than 2 registers, when you have to deal with carry-in and carry-out to the same add operation.)
You can emulate cmov
as you say with a mask for AND/ANDnot/OR, but that would take quite a few instructions and is usually not worth it except possibly on wide and deep out-of-order machines, where the amount of work discarded by a branch miss is a lot bigger. (mask = (c == 0) - 1;
which you can do with sltiu
/ add reg,reg, -1
to turn 0 into -1 and 1 into 0.)
You kind of have it backwards in terms of which kind of microarchitecture benefits more from CMOV, although there are potential benefits either way. And an in-order machine already kind of has to wait at a conditional branch for the condition to resolve, vs. an out-of-order machine treating control dependencies very differently from data dependencies. As discussed in gcc optimization flag -O3 makes code slower than -O2, data dependencies through cmov
can create a loop-carried dependency chain that's a bigger bottleneck that highly predictable branches.
There are some out-of-order exec RISC-V designs, maybe even some that are open-source. For example, Erik Eidt linked The Berkeley Out-of-Order Machine (BOOM).
The RISC-V extension B proposal has a conditional move, along with scalar min/max, popcount, leading/trailing zero count, bitfield insert/extract, two-register shifts, and a bunch of more esoteric stuff. https://five-embeddev.com/riscv-bitmanip/draft/bext.html#conditional-move-cmov
Looking at the list of proposed instructions, it's amazing what got left out of baseline RISC-V, like sign-extension of narrow integers (currently requires slli/srai) if it's not already guaranteed by the calling convention or a load instruction, and standard stuff like popcount and leading/trailing zero count that most ISAs have.
Godbolt shows clang 12.0 using cmov
, min
, and sext.b
. In that clang version, -O3 -Wall -menable-experimental-extensions -march=rv32gcb0p93
was the magic incantation to do that. Extension B 0.93 is enabled by the b0p93
part of the string. (Extension B isn't finalized, and I don't know what version clang 14.0 was looking for; its error message wasn't helpful, and just plain -march=rv32gcb
didn't get the compiler to actually use cmov
.)
// -march=rv32gcb0p93 includes extension b 0.93 (0p93)
int sel(int x, int y, int c){
return c ? x : y;
}
# extension B clang
cmov a0, a2, a0, a1
ret
# baseline gcc11.3 (clang and GCC12 waste several mv instructions)
bne a2,zero,.L2
mv a0,a1
.L2:
ret
int min(int x, int y, int c){
return (x<y) ? x : y;
}
# extension B clang
min a0, a0, a1
ret
# baseline gcc
ble a0,a1,.L5
mv a0,a1
.L5:
ret
int sext(int c){
return (signed char)c;
}
# extension B clang
sext.b a0, a0
ret
# baseline gcc
slli a0,a0,24
srai a0,a0,24
ret