I am looking for compiler flags of GCC/CLANG to generate BEXTR
instruction.
template <auto uSTART, auto uLENGTH, typename Tunsigned>
constexpr Tunsigned bit_extract(Tunsigned uInput)
{
return (uInput >> uSTART) & ~(~Tunsigned(0) << uLENGTH);
}
template unsigned bit_extract<2, 3>(unsigned);
The compiler flags -std=c++17 -O2 -mbmi
do not generate BEXTR instruction. (my example in compiler explorer)
The compiler flags -std=c++17 -O2 -march=native
may generate BEXTR instruction. (my example in compiler explorer)
What compiler flags should I use to generate BEXTR
instruction?
The relevant difference between clang -mbmi
and -march=native
was -mtune=znver3
.
Godbolt runs on AWS instances that can be Zen 3, Ice Lake, and maybe sometimes earlier Intel like SKX or Cascade Lake.
bextr
is single uop (with 1 cycle latency) on AMD Zen family and thus sometimes worth using with mov-immediate to set up a control constant. The critical path latency from data in to data out is only 1 cycle. https://uops.info/
Apparently clang knows this and will use BEXTR when it's useful for -march=znver*
(which implies -mtune), but not -march=gracemont
where it's also 1 uop (Alder Lake E-cores, and some upcoming low-power chips.) This is a tuning bug that you could report to https://github.com/llvm/llvm-project/issues/ if it's not already in the works.
GCC probably doesn't look for BEXTR patterns at all, at least not with constant shift counts. Using your expression for runtime-variable start,length, it uses BMI2 SHRX+BZHI, or a much worse sequence with -mno-bmi2
.
On Intel P-cores BEXTR is 2 uops (and 2c latency) and thus not better than immediate SHR and AND. (Or shl
/ shr
which GCC uses for bit-ranges larger than 32, where the mask wouldn't fit in an immediate for AND.) The mov
register copy can benefit from mov-elimination, or a smarter compiler could have used rorx
to copy-and-shift before masking to optimize for Ice Lake where mov-elimination is disabled for integer regs. rorx
would cost more code-size for 32-bit operand-size than mov+shr, but about break-even for 64-bit where both instructions need a REX.
BEXTR's 2 uops run on port 0/6 for one and port 1/5 for the other. Except on Alder Lake P-cores where the second uop can only run on port 1. It's probably the same internal uops as BMI2 shrx
(p0/p6) and BMI2 bzhi
, where Alder Lake made the same change1: bzhi
is port 1 only, vs. p1/5 in earlier Intel since Haswell.
BMI1 was new in 2012 with AMD's Piledriver. (https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set). When Intel decided to support it, they decided to implement bextr
in terms of a shift and a new bzhi
operation, which they exposed directly as a BMI2 instruction. For Intel, BMI1 and BMI2 were both new in Haswell.
AMD's TBM (trailing bit manipulation) included an immediate version of bextr
, with a 16-bit immediate
The ideal use-case for BMI1 bextr
is a loop-invariant but runtime-variable bit-range. It's probably cheaper to pack the control fields than to generate a shift count in a register for shrx
and a mask for and
or andn
, plus it's only one register to dedicate to holding that.
Clang does that for a non-looping single use case with runtime-variable bitfield location, but ironically not for a loop.
// dummy arg gives it the option of mov dh, cl which would be faster on Zen 3 (no merging uop needed)
unsigned long bext_nonconst(int unused, unsigned long uInput, unsigned uSTART, unsigned uLENGTH)
{
return (uInput >> uSTART) & ~(~0ULL << uLENGTH);
}
Godbolt compiler output:
# clang -O3 -march=znver3
shrx rax, rsi, rdx
bzhi rax, rax, rcx
ret
# clang -O3 -march=znver3 -mno-bmi2
shl ecx, 8
movzx eax, dl # correctly choosing a separate destination for zero elimination
or eax, ecx
bextr rax, rsi, rax
ret
# could have been mov dh, cl to make RDX the control reg
# but compilers only sometimes use partial registers
In a loop, the perfect use-case for bextr
, clang avoids it. /facepalm. So does GCC, still, but we just saw clang use bextr
for one extract not in a loop with the same tuning options.
void bext_array(unsigned long *dst, const unsigned long *src, unsigned bitstart, unsigned bitlen){
for(int i=0 ; i<10240 ; i++){
dst[i] = bext_nonconst(1, src[i], bitstart, bitlen);
}
}
# clang -O3 -march=znver3 -mno-bmi2
bext_array(unsigned long*, unsigned short*, unsigned int, unsigned int):
mov eax, edx # copy bitstart for no reason
mov rdx, -1
xor r8d, r8d
shl rdx, cl # -1ULL<<bitlen - count was already in ECX
not rdx # mask = ~(~0ULL << bitlen)
.LBB4_1: # =>This Inner Loop Header: Depth=1
mov r9, qword ptr [rsi + 8*r8]
mov ecx, eax # could have been done outside the loop, missed optimization
shr r9, cl
and r9, rdx # (src[i] >> bitstart) & mask
mov qword ptr [rdi + 8*r8], r9
inc r8
cmp r8, 10240
jne .LBB4_1
ret
The loop body could have been bextr rax, [rsi + rcx*8], rdx
/ mov [rdi + rcx*8]
+ inc ecx
or whatever. (Apparently bextr
can't micro-fuse a memory source operand so it's always an extra uop, even on AMD, and even with a one-register addressing mode on Intel.)
bextr
is two single-cycle uops on Alder Lake, same as earlier Intel. And shrx
and bzhi
are still single cycle latency, not 3. Given another measurement which got this sane result, and inconsistencies in the uops.info measurements, and the presumed sanity of Intel's CPU architects, it's almost certainly it's a microbenchmarking error.
https://uops.info/ measured Alder Lake P-cores having 3 cycle latency per uop in shrx
, bzhi
, and bextr
, but that would be insane from a CPU design PoV (especially shrx
/shlx
, while shl reg, cl
is still 1 cycle latency for the reg result!) And uops.info measured 4c latency from one input but 6c latency from the other, but Intel generally doesn't support late forwarding (where a uop could start with only some of its inputs ready). 3c could have been plausible if one uop was just pre-processing an input that wasn't on the critical path, but not 4c.
They also measured memory-source shrx
with 1c latency from shift-count to result, which is inconsistent with reg-source being 3 cycles, but consistent with 1c latency shifts.
InstLatx64 measured i9-12900K Alder Lake's Golden Cove P-cores as having single-cycle latency for bzhi
and shrx
, and 2 cycle latency for bextr
, using simpler measurement sequences with just the instruction repeated. InstLat also confirms the throughput changes (to 1/clock for bzhi
and bextr
). These make sense, matching what I'd expected.
So I think something must be weird on Alder Lake with the instruction sequence uops.info uses to measure latency. They use movsxd
to make a dependency chain between instructions under test (to distinguish different inputs for multi-uop instructions, and in case of stuff like shl reg,cl
where the latency chain of basically false dependencies through FLAGS is slower than the data shift itself.)
uops.info shows the actual loop bodies for every experiment, and the raw perf counter numbers, which is fantastic, much better than any other instruction table. It allows digging deeper to see what was actually measured in surprising cases, which in some previous cases has been enough to see what was going on and report a bug to Andreas Abel (who has fixed the test sequence for affected instructions in previous cases.) I emailed him about this, will update once we figure out what's going on.
movsxd r64, r32
measurements on Alder Lake seem normal, 1c latency for any of 3 ports (1/5/B), measuring MOVSXD R9, R8D
with movsxd r8,r9d
then a chain of 4x movsxd r8,r8d
.
(InstLatx64 reports MOVSXD r1_64, r2_32
as having 0.3c latency, but should actually be reporting that as "no true dep" like for cmp
, since they don't create a dep chain that feeds the result back into the input. Unlike their XCHG r1_32, r2_32
where separate regs lets us see the 1.5c average latency since it's 2c one way, 1c the other. All this tells us is there's no (false) output dependency, like lz/tzcnt
and popcnt
had on Intel for a while. Alder Lake hasn't started doing mov-elimination for movsx[d] and 16-bit sources for movzx.)
The numbers measured are accurate for the sequences they did actually test, and these are latency tests so front-end effects probably weren't the problem. e.g. for bextr r32,r32,r32
on ADL. For testing "Latency operand 2 → 1: 6 cycles", the loop body was
0: c4 42 28 f7 c8 bextr r9d,r8d,r10d
5: 4d 63 c1 movsxd r8,r9d # feed output back to operand 2
8: 4d 63 c0 movsxd r8,r8d # and add some latency
b: 4d 63 c0 movsxd r8,r8d
e: 4d 63 c0 movsxd r8,r8d
11: 4d 63 c0 movsxd r8,r8d
They expected a "chain latency" of 5 cycles for the 5 movsxd
instructions. They measured 11.0
core clock cycles, and UOPS_EXECUTED.THREAD: 7.0
which rules out any mov-elimination of movsxd
(which would have been surprising.) 2 uop from bextr
, plus 5 movsxd
. (And any lower latency for movsxd
would have made the calculated latency even higher for bextr
, since the calculation assumed 1c each.)
I don't know what about movsxd
would be special, but testing with lea r8d, [r9+1]
would also work as a 1c latency instruction with 1 input and a write-only destination. (On earlier CPUs, LEA ran on fewer ports so movsxd
was a good choice.)
lea
with a 64-bit destination and a small [reg+disp8]
addressing mode can be eliminated on Alder Lake's Golden Cove P cores, zero latency, but not for 32-bit destinations. (Still, that might change, so lea
is actually not a good future-proof choice). IDK how they write a different register value without an ALU uop; maybe the RAT (register allocation table) supports some kind of offset which can be added later during register read? Like the stack engine, but able to be read by back-end uops? It also works for add r64, imm8
but not add r32, imm8
. https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity/5
On the Gracemont E-cores, that anandtech article says movsx
is supposed to have mov-elimination, but uops.info measures 1c latency for all reg-source cases. And the high latency measurement were on the P cores anyway.
LEA_B_IS_D8 (R32)
e.g. lea r8d,[r14+r13*2+0x8]
measured at 2c latency might be an example of this effect. The Anandtech changelog says "LEA with scale latency 2->3 clocks" for Golden Cove vs. Cyprus Cove, so that measurement is actually a cycle faster than it apparently should be.
Other affected instructions include the FLAGs result of simple integer instructions like add
, sub
, and and
, as well as andn
, where uops.info reports their latency as [1:2], which means 1c for some combinations of inputs and outputs (in this case regs to output reg), 2c for others (input regs to FLAGS). So it's not just instructions with a separate write-only destination that are weird. (But FLAGS is technically a separate output, not one of the inputs.)