gccx86clangcompiler-optimizationbmi

Compiler flags of GCC/CLANG to generate "BEXTR" instruction (of IA32's BMI1)


I am looking for compiler flags of GCC/CLANG to generate BEXTR instruction.

template <auto uSTART, auto uLENGTH, typename Tunsigned>
constexpr Tunsigned bit_extract(Tunsigned uInput)
{
    return (uInput >> uSTART) & ~(~Tunsigned(0) << uLENGTH);
}

template unsigned bit_extract<2, 3>(unsigned);

The compiler flags -std=c++17 -O2 -mbmi do not generate BEXTR instruction. (my example in compiler explorer)

The compiler flags -std=c++17 -O2 -march=native may generate BEXTR instruction. (my example in compiler explorer)

What compiler flags should I use to generate BEXTR instruction?


Solution

  • The relevant difference between clang -mbmi and -march=native was -mtune=znver3.
    Godbolt runs on AWS instances that can be Zen 3, Ice Lake, and maybe sometimes earlier Intel like SKX or Cascade Lake.

    bextr is single uop (with 1 cycle latency) on AMD Zen family and thus sometimes worth using with mov-immediate to set up a control constant. The critical path latency from data in to data out is only 1 cycle. https://uops.info/

    Apparently clang knows this and will use BEXTR when it's useful for -march=znver* (which implies -mtune), but not -march=gracemont where it's also 1 uop (Alder Lake E-cores, and some upcoming low-power chips.) This is a tuning bug that you could report to https://github.com/llvm/llvm-project/issues/ if it's not already in the works.

    GCC probably doesn't look for BEXTR patterns at all, at least not with constant shift counts. Using your expression for runtime-variable start,length, it uses BMI2 SHRX+BZHI, or a much worse sequence with -mno-bmi2.

    On Intel P-cores BEXTR is 2 uops (and 2c latency) and thus not better than immediate SHR and AND. (Or shl / shr which GCC uses for bit-ranges larger than 32, where the mask wouldn't fit in an immediate for AND.) The mov register copy can benefit from mov-elimination, or a smarter compiler could have used rorx to copy-and-shift before masking to optimize for Ice Lake where mov-elimination is disabled for integer regs. rorx would cost more code-size for 32-bit operand-size than mov+shr, but about break-even for 64-bit where both instructions need a REX.

    BEXTR's 2 uops run on port 0/6 for one and port 1/5 for the other. Except on Alder Lake P-cores where the second uop can only run on port 1. It's probably the same internal uops as BMI2 shrx (p0/p6) and BMI2 bzhi, where Alder Lake made the same change1: bzhi is port 1 only, vs. p1/5 in earlier Intel since Haswell.

    BMI1 was new in 2012 with AMD's Piledriver. (https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set). When Intel decided to support it, they decided to implement bextr in terms of a shift and a new bzhi operation, which they exposed directly as a BMI2 instruction. For Intel, BMI1 and BMI2 were both new in Haswell.

    AMD's TBM (trailing bit manipulation) included an immediate version of bextr, with a 16-bit immediate


    The ideal use-case for BMI1 bextr is a loop-invariant but runtime-variable bit-range. It's probably cheaper to pack the control fields than to generate a shift count in a register for shrx and a mask for and or andn, plus it's only one register to dedicate to holding that.

    Clang does that for a non-looping single use case with runtime-variable bitfield location, but ironically not for a loop.

    // dummy arg gives it the option of mov dh, cl which would be faster on Zen 3 (no merging uop needed)
    unsigned long bext_nonconst(int unused, unsigned long uInput, unsigned uSTART, unsigned uLENGTH)
    {
         return (uInput >> uSTART) & ~(~0ULL << uLENGTH);
    }
    

    Godbolt compiler output:

    # clang -O3 -march=znver3
            shrx    rax, rsi, rdx
            bzhi    rax, rax, rcx
            ret
    
    # clang -O3 -march=znver3  -mno-bmi2
            shl     ecx, 8
            movzx   eax, dl        # correctly choosing a separate destination for zero elimination
            or      eax, ecx
            bextr   rax, rsi, rax
            ret
        # could have been  mov dh, cl   to make RDX the control reg
        # but compilers only sometimes use partial registers
    

    In a loop, the perfect use-case for bextr, clang avoids it. /facepalm. So does GCC, still, but we just saw clang use bextr for one extract not in a loop with the same tuning options.

    void bext_array(unsigned long *dst, const unsigned long *src, unsigned bitstart, unsigned bitlen){
        for(int i=0 ; i<10240 ; i++){
            dst[i] = bext_nonconst(1, src[i], bitstart, bitlen);
        }
    }
    
    # clang -O3 -march=znver3  -mno-bmi2
    bext_array(unsigned long*, unsigned short*, unsigned int, unsigned int):
            mov     eax, edx    # copy bitstart for no reason
            mov     rdx, -1
            xor     r8d, r8d
            shl     rdx, cl     # -1ULL<<bitlen   - count was already in ECX
            not     rdx         # mask = ~(~0ULL << bitlen)
    .LBB4_1:                                # =>This Inner Loop Header: Depth=1
            mov     r9, qword ptr [rsi + 8*r8]
            mov     ecx, eax     # could have been done outside the loop, missed optimization
            shr     r9, cl
            and     r9, rdx      # (src[i] >> bitstart) & mask
            mov     qword ptr [rdi + 8*r8], r9
            inc     r8
            cmp     r8, 10240
            jne     .LBB4_1
            ret
    

    The loop body could have been bextr rax, [rsi + rcx*8], rdx / mov [rdi + rcx*8] + inc ecx or whatever. (Apparently bextr can't micro-fuse a memory source operand so it's always an extra uop, even on AMD, and even with a one-register addressing mode on Intel.)


    Footnote 1: https://uops.info/ wrong latency measurements on Alder Lake

    bextr is two single-cycle uops on Alder Lake, same as earlier Intel. And shrx and bzhi are still single cycle latency, not 3. Given another measurement which got this sane result, and inconsistencies in the uops.info measurements, and the presumed sanity of Intel's CPU architects, it's almost certainly it's a microbenchmarking error.

    https://uops.info/ measured Alder Lake P-cores having 3 cycle latency per uop in shrx, bzhi, and bextr, but that would be insane from a CPU design PoV (especially shrx/shlx, while shl reg, cl is still 1 cycle latency for the reg result!) And uops.info measured 4c latency from one input but 6c latency from the other, but Intel generally doesn't support late forwarding (where a uop could start with only some of its inputs ready). 3c could have been plausible if one uop was just pre-processing an input that wasn't on the critical path, but not 4c.

    They also measured memory-source shrx with 1c latency from shift-count to result, which is inconsistent with reg-source being 3 cycles, but consistent with 1c latency shifts.

    InstLatx64 measured i9-12900K Alder Lake's Golden Cove P-cores as having single-cycle latency for bzhi and shrx, and 2 cycle latency for bextr, using simpler measurement sequences with just the instruction repeated. InstLat also confirms the throughput changes (to 1/clock for bzhi and bextr). These make sense, matching what I'd expected.

    So I think something must be weird on Alder Lake with the instruction sequence uops.info uses to measure latency. They use movsxd to make a dependency chain between instructions under test (to distinguish different inputs for multi-uop instructions, and in case of stuff like shl reg,cl where the latency chain of basically false dependencies through FLAGS is slower than the data shift itself.)

    uops.info shows the actual loop bodies for every experiment, and the raw perf counter numbers, which is fantastic, much better than any other instruction table. It allows digging deeper to see what was actually measured in surprising cases, which in some previous cases has been enough to see what was going on and report a bug to Andreas Abel (who has fixed the test sequence for affected instructions in previous cases.) I emailed him about this, will update once we figure out what's going on.

    movsxd r64, r32 measurements on Alder Lake seem normal, 1c latency for any of 3 ports (1/5/B), measuring MOVSXD R9, R8D with movsxd r8,r9d then a chain of 4x movsxd r8,r8d.

    (InstLatx64 reports MOVSXD r1_64, r2_32 as having 0.3c latency, but should actually be reporting that as "no true dep" like for cmp, since they don't create a dep chain that feeds the result back into the input. Unlike their XCHG r1_32, r2_32 where separate regs lets us see the 1.5c average latency since it's 2c one way, 1c the other. All this tells us is there's no (false) output dependency, like lz/tzcnt and popcnt had on Intel for a while. Alder Lake hasn't started doing mov-elimination for movsx[d] and 16-bit sources for movzx.)

    The numbers measured are accurate for the sequences they did actually test, and these are latency tests so front-end effects probably weren't the problem. e.g. for bextr r32,r32,r32 on ADL. For testing "Latency operand 2 → 1: 6 cycles", the loop body was

       0:   c4 42 28 f7 c8          bextr  r9d,r8d,r10d
       5:   4d 63 c1                movsxd r8,r9d    # feed output back to operand 2
       8:   4d 63 c0                movsxd r8,r8d    # and add some latency
       b:   4d 63 c0                movsxd r8,r8d
       e:   4d 63 c0                movsxd r8,r8d
      11:   4d 63 c0                movsxd r8,r8d
    

    They expected a "chain latency" of 5 cycles for the 5 movsxd instructions. They measured 11.0 core clock cycles, and UOPS_EXECUTED.THREAD: 7.0 which rules out any mov-elimination of movsxd (which would have been surprising.) 2 uop from bextr, plus 5 movsxd. (And any lower latency for movsxd would have made the calculated latency even higher for bextr, since the calculation assumed 1c each.)

    I don't know what about movsxd would be special, but testing with lea r8d, [r9+1] would also work as a 1c latency instruction with 1 input and a write-only destination. (On earlier CPUs, LEA ran on fewer ports so movsxd was a good choice.)

    lea with a 64-bit destination and a small [reg+disp8] addressing mode can be eliminated on Alder Lake's Golden Cove P cores, zero latency, but not for 32-bit destinations. (Still, that might change, so lea is actually not a good future-proof choice). IDK how they write a different register value without an ALU uop; maybe the RAT (register allocation table) supports some kind of offset which can be added later during register read? Like the stack engine, but able to be read by back-end uops? It also works for add r64, imm8 but not add r32, imm8. https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity/5

    On the Gracemont E-cores, that anandtech article says movsx is supposed to have mov-elimination, but uops.info measures 1c latency for all reg-source cases. And the high latency measurement were on the P cores anyway.

    LEA_B_IS_D8 (R32) e.g. lea r8d,[r14+r13*2+0x8] measured at 2c latency might be an example of this effect. The Anandtech changelog says "LEA with scale latency 2->3 clocks" for Golden Cove vs. Cyprus Cove, so that measurement is actually a cycle faster than it apparently should be.

    Other affected instructions include the FLAGs result of simple integer instructions like add, sub, and and, as well as andn, where uops.info reports their latency as [1:2], which means 1c for some combinations of inputs and outputs (in this case regs to output reg), 2c for others (input regs to FLAGS). So it's not just instructions with a separate write-only destination that are weird. (But FLAGS is technically a separate output, not one of the inputs.)