[SOLVED] Is there any difference in the efficiency of counting leading/trailing ones/zeros?

Is there any difference in the efficiency of counting leading/trailing ones/zeros?

I'm designing a prefixed variable length integer.

Rust has methods for counting leading and trailing ones and zeros: https://doc.rust-lang.org/std/primitive.u64.html#method.leading_zeros

Is there any difference in the efficiency of these methods on x86_64, arm32 and arm64?

e.g. If counting trailing ones is faster than trailing zeros, I will use xxxx0111 instead of xxxx1000 for the length encoding byte (for three following bytes, in this example).

Solution

Counting trailing zeros is faster than trailing ones on all 3 ISAs: x86*, ARM, AArch64. They all provide zero-counting instructions like x86 bsf (find the lowest set bit) or x86 BMI1 tzcnt (Trailing Zero Count). Counting leading/trailing ones in a runtime-variable value would require NOTing the input.

ARM / AArch64 only provide a leading-zero count (clz). Other operations have to be built out of that, but rbit (bit-reverse) exists since ARMv6t2 or ARMv7 which makes trailing-zero counting only cost 2 instructions. https://godbolt.org/z/Yr7eac. Before rbit, compilers have to isolate the lowest set bit with x & -x, count leading zeros of that, and take 31-clz(x&-x). ARM does have rsb-immediate (reverse-subtract) so that last step is still only a single instruction.

(On x86, counting leading zeros is most efficient with BMI1. Without it, bsr can give you position of the highest set bit, so the compiler needs 31-bsr(x) to implement clz. On AMD CPUs, bsf / bsr are significantly slower than their tzcnt / lzcnt counterparts, so if possible it's good to compile with -march=native or whatever the Rust equivalent is.)

On AMD Zen, lzcnt is 1 uop, tzcnt is 2 uops. https://uops.info/ Presumably the extra uop is either bit-reverse or isolate lowest set bit with x & -x, but either way generating something to feed to the lzcnt hardware.

Related fun fact: looping over the set bits to find their positions is often best done with x = blsr(x) as the loop-carried dependency, x &= x-1, and independently bit-scan each result for the set bit. So there isn't a bit-scan and shift in the critical path latency of the loop.

Related: Set the leading zero bits in any size integer C++ bsr/bsf and tzcnt performance on historical x86 CPUs.