I'm designing a prefixed variable length integer.
Rust has methods for counting leading and trailing ones and zeros: https://doc.rust-lang.org/std/primitive.u64.html#method.leading_zeros
Is there any difference in the efficiency of these methods on x86_64, arm32 and arm64?
e.g. If counting trailing ones is faster than trailing zeros, I will use xxxx0111 instead of xxxx1000 for the length encoding byte (for three following bytes, in this example).
Counting trailing zeros is faster than trailing ones on all 3 ISAs: x86*, ARM, AArch64. They all provide zero-counting instructions like x86 bsf
(find the lowest set bit) or x86 BMI1 tzcnt
(Trailing Zero Count). Counting leading/trailing ones in a runtime-variable would require negating the input.
ARM / AArch64 provide a leading-zero count, but the best option for trailing-zero is rbit
/ clz
to bit-reverse (since ARMv6t2 or ARMv7). https://godbolt.org/z/Yr7eac. Before that, compilers have to isolate the lowest set bit with x & -x
, count leading zeros of that, and take 31-clz(x&-x)
.
(On x86, counting leading zeros is most efficient with BMI1. Without it, bsr
can give you position of the highest set bit, so the compiler needs 31-bsr(x)
to implement clz. On AMD CPUs, bsf / bsr are significantly slower than their tzcnt / lzcnt counterparts, so if possible it's good to compile with -march=native
or whatever the Rust equivalent is.)
On AMD Zen, lzcnt
is 1 uop, tzcnt
is 2 uops. https://uops.info/ Presumably the extra uop is either bit-reverse or isolate lowest set bit with x & -x
, but either way generating something to feed to the lzcnt
hardware.
Related fun fact: looping over the set bits to find their positions is often best done with x = blsr(x)
as the loop-carried dependency, x &= x-1
, and independently bit-scan each result for the set bit. So there isn't a bit-scan and shift in the critical path latency of the loop.
Related: Set the leading zero bits in any size integer C++ bsr/bsf and tzcnt performance on historical x86 CPUs.