I am writing an LLVM pass module to instrument every single memory operation in a program, and part of my logic needs to do some very hot binary logic on pointers.
How can I implement "bit ? u64_value : zero" in as few cycles as possible, preferably without using an explicit branch? I have a bit in the least significant bit of a register, and a value (assume u64) in another. If the bit is set, i want the value preserved. If the bit is zero, I want to zero out the register.
I can use x86 BMI instructions.
On AMD, and Intel Broadwell and later, CMOV is only 1 uop, with 1 cycle of latency. Or 2 uops / 2 cycles on Haswell and earlier. It's your best bet for conditionally zeroing a register.
xor r10d, r10d # r10=0. hoist out of loops if possible
test al, 1 # test the low bit of RAX, setting ZF
cmovz rax, r10 # zero RAX if the low bit was zero, otherwise unmodified
(test r64, imm8
encoding doesn't exist, so you want to use the low-8 register if you're testing a mask that's all zero outside the low 8 bits.)
If the bit-position is in a register, bt reg, reg
only 1 uop on Intel and AMD. (bts reg,reg
is 2 uops on AMD K8 through Ryzen, but plain bt
that sets CF according to the value of the selected bit is cheap on AMD and Intel.)
bt rax, rdx # CF = RAX & (1<<rdx)
cmovnc rax, r10
With both of these, the register you test can be different from the CMOV destination.
See https://agner.org/optimize/ for more performance info, and also https://stackoverflow.com/tags/x86/info