rustarmbit-manipulationx86-64varint

Is there any difference in the efficiency of counting leading/trailing ones/zeros?


I'm designing a prefixed variable length integer.

Rust has methods for counting leading and trailing ones and zeros: https://doc.rust-lang.org/std/primitive.u64.html#method.leading_zeros

Is there any difference in the efficiency of these methods on x86_64, arm32 and arm64?

e.g. If counting trailing ones is faster than trailing zeros, I will use xxxx0111 instead of xxxx1000 for the length encoding byte (for three following bytes, in this example).


Solution

  • Counting trailing zeros is faster than trailing ones on all 3 ISAs: x86*, ARM, AArch64. They all provide zero-counting instructions like x86 bsf (find the lowest set bit) or x86 BMI1 tzcnt (Trailing Zero Count). Counting leading/trailing ones in a runtime-variable would require negating the input.

    ARM / AArch64 provide a leading-zero count, but the best option for trailing-zero is rbit / clz to bit-reverse (since ARMv6t2 or ARMv7). https://godbolt.org/z/Yr7eac. Before that, compilers have to isolate the lowest set bit with x & -x, count leading zeros of that, and take 31-clz(x&-x).


    (On x86, counting leading zeros is most efficient with BMI1. Without it, bsr can give you position of the highest set bit, so the compiler needs 31-bsr(x) to implement clz. On AMD CPUs, bsf / bsr are significantly slower than their tzcnt / lzcnt counterparts, so if possible it's good to compile with -march=native or whatever the Rust equivalent is.)

    On AMD Zen, lzcnt is 1 uop, tzcnt is 2 uops. https://uops.info/ Presumably the extra uop is either bit-reverse or isolate lowest set bit with x & -x, but either way generating something to feed to the lzcnt hardware.

    Related fun fact: looping over the set bits to find their positions is often best done with x = blsr(x) as the loop-carried dependency, x &= x-1, and independently bit-scan each result for the set bit. So there isn't a bit-scan and shift in the critical path latency of the loop.

    Related: Set the leading zero bits in any size integer C++ bsr/bsf and tzcnt performance on historical x86 CPUs.