assemblycpu-architectureriscvbranchlessconditional-move

RISCV branchless coding


On Intel AVX, there is a possibility of branchless code. Instead of branching for case0 or case1, you can compute both cases, and blend the results based on a condition.

AVX does this 8 way for float using the vblendps instruction.

You can also do this in a scalar way, without a vector, using the x86 instruction CMOVcc which performs a move operation, conditionally.

NOTE: ARM has CSEL and NEON has VBSL.

Can RISCV64 do a scalar move like this, so that you do not have to branch for

a = c ? x : y;

As I understand, RISCV implementations are in-order, so it would benefit even more than x86 when not having to branch. (The latter can at least shuffle around some instructions, and even branch speculatively to hide latency.)

The closest I can find w.r.t branchless operation for riscv is SLT (Set Less Than) but that sets to 1 or 0, and then would need multiplications? Wouldn't it be more useful to have SLT set to -1 or 0 instead, so that we can AND that?

UPDATE

When doing:

int foo(int a, int b, int x, int y)
{
    return a < b ? x : y;
}

I tried a poor-man's version of branchless using SLT. I am not sure if I did it completely right, by using bitmask as 0 - condition(0|1), I came up with:

branchless:
    SLT t0,a0,a1
    SUB t0,zero,t0
    NOT t1,t0
    AND t0,a2,t0
    AND t1,a3,t1
    OR  a0,t0,t1
    RET
    .size   branchless, .-branchless

as the branchless version of:

branched:
    BGE a0,a1,.L2
    MV  a3,a2
.L2:
    MV  a0,a3
    RET
    .size   branched, .-branched

I wonder if I used too many instructions for this, but I measured the branching version to be slightly faster than the non-branching one on random data, but not by much.


Solution

  • Update: see sh1's answer for the current situation: there's a conditional-zero instruction, like cmov from x0. The full cmov was dropped from the planned discussions before extension B made it to v1.0 (and extension B was split into some separate parts). An article has some details and links on the situation as of mid 2023.

    Current compilers no longer support b as a single-letter extension name either.


    The proposed RISC-V extension B includes cmov (with 4 operands: 3 inputs and a separate destination!). (Version 0.93 was current when the rest of this answer was written.)

    I think David Patterson (one of the lead architects behind MIPS and RISC-V) really dislikes cmov (along with short-vector SIMD like SSE/AVX) and thinks CPUs should specially handle "hammock" branches (that jump forward over a single instruction like a move) if they want to do that. Something like that. So this seems to be a case of philosophical purity getting in the way of including useful instructions. (AArch64 is a much more pragmatic design, still being RISC in the ways that matter for a high-performance implementation.)

    And/or perhaps a desire to limit instructions to at most 2 inputs, if there aren't any other 3-input instructions. That means a scalar pipeline only needs 2 register read ports, not 3, if it strictly follows this restriction. (That also means no add-with-carry, making extended-precision math quite a pain for numbers wider than 2 registers, when you have to deal with carry-in and carry-out to the same add operation.)

    You can emulate cmov as you say with a mask for AND/ANDnot/OR, but that would take quite a few instructions and is usually not worth it except possibly on wide and deep out-of-order machines, where the amount of work discarded by a branch miss is a lot bigger. (mask = (c == 0) - 1; which you can do with sltiu / add reg,reg, -1 to turn 0 into -1 and 1 into 0.)

    You kind of have it backwards in terms of which kind of microarchitecture benefits more from CMOV, although there are potential benefits either way. And an in-order machine already kind of has to wait at a conditional branch for the condition to resolve, vs. an out-of-order machine treating control dependencies very differently from data dependencies. As discussed in gcc optimization flag -O3 makes code slower than -O2, data dependencies through cmov can create a loop-carried dependency chain that's a bigger bottleneck that highly predictable branches.

    There are some out-of-order exec RISC-V designs, maybe even some that are open-source. For example, Erik Eidt linked The Berkeley Out-of-Order Machine (BOOM).


    Extension B: where they put all the fun instructions they left out

    The RISC-V extension B proposal has a conditional move, along with scalar min/max, popcount, leading/trailing zero count, bitfield insert/extract, two-register shifts, and a bunch of more esoteric stuff. https://five-embeddev.com/riscv-bitmanip/draft/bext.html#conditional-move-cmov

    Looking at the list of proposed instructions, it's amazing what got left out of baseline RISC-V, like sign-extension of narrow integers (currently requires slli/srai) if it's not already guaranteed by the calling convention or a load instruction, and standard stuff like popcount and leading/trailing zero count that most ISAs have.

    Godbolt shows clang 12.0 using cmov, min, and sext.b. In that clang version, -O3 -Wall -menable-experimental-extensions -march=rv32gcb0p93 was the magic incantation to do that. Extension B 0.93 is enabled by the b0p93 part of the string. (Extension B isn't finalized, and I don't know what version clang 14.0 was looking for; its error message wasn't helpful, and just plain -march=rv32gcb didn't get the compiler to actually use cmov.)

    //  -march=rv32gcb0p93 includes extension b 0.93 (0p93)
    
    int sel(int x, int y, int c){
        return c ? x : y;
    }
    # extension B  clang
            cmov    a0, a2, a0, a1
            ret
    
    # baseline gcc11.3  (clang and GCC12 waste several mv instructions)
            bne     a2,zero,.L2
            mv      a0,a1
    .L2:
            ret
    
    int min(int x, int y, int c){
        return (x<y) ? x : y;
    }
    # extension B  clang
            min     a0, a0, a1
            ret
    
    # baseline gcc
            ble     a0,a1,.L5
            mv      a0,a1
    .L5:
            ret
    
    int sext(int c){
        return (signed char)c;
    }
    # extension B  clang
            sext.b  a0, a0
            ret
    
    # baseline gcc
            slli    a0,a0,24
            srai    a0,a0,24
            ret