[SOLVED] How to unset N right-most set bits

How to unset N right-most set bits

There is a relatively well-known trick for unsetting a single right-most bit:

y = x & (x - 1) // 0b001011100 & 0b001011011 = 0b001011000 :)

I'm finding myself with a tight loop to clear n right-most bits, but is there a simpler algebraic trick?

Assume relatively large n (n has to be <64 for 64bit integers, but it's often on the order of 20-30).

// x = 0b001011100 n=2
for (auto i=0; i<n; i++) x &= x - 1;
// x = 0b001010000

I've thumbed my TAOCP Vol4a few times, but can't find any inspiration.

Maybe there is some hardware support for it?

Solution

For Intel x86 CPUs with BMI2, pext and pdep are fast. AMD before Zen3 has very slow microcoded PEXT/PDEP (https://uops.info/) so be careful with this; other options might be faster on AMD, maybe even blsi in a loop, or better a binary-search on popcount (see below).
Only Intel has dedicated hardware execution units for the mask-controlled pack/unpack that pext/pdep do, making it constant-time: 1 uop, 3 cycle latency, can only run on port 1.

I'm not aware of other ISAs having a similar bit-packing hardware operation.

(For the much simpler problem of simply the low n bits, x & (-1ULL<<n) if you don't need to handle n=64. Otherwise maybe bzhi against -1 / andn.)

pdep basics: pdep(-1ULL, a) == a. Taking the low popcnt(a) bits from the first operand, and depositing them at the places where a has set bits, will give you a back again.

But if, instead of all-ones, your source of bits has the low N bits cleared, the first N set bits in a will grab a 0 instead of 1. This is exactly what you want.

uint64_t unset_first_n_bits_bmi2(uint64_t a, int n){
    return _pdep_u64(-1ULL << n, a);
}

-1ULL << n works for n=0..63 in C. x86 asm scalar shift instructions mask their count (effectively &63), so that's probably what will happen for the C undefined-behaviour of a larger n. If you care, use n&63 in the source so the behaviour is well-defined in C, and it can still compile to a shift instruction that uses the count directly.

On Godbolt with a simple looping reference implementation, showing that they produce the same result for a sample input a and n.

GCC and clang both compile it the obvious way, as written:

# GCC10.2 -O3 -march=skylake
unset_first_n_bits_bmi2(unsigned long, int):
        mov     rax, -1
        shlx    rax, rax, rsi
        pdep    rax, rax, rdi
        ret

(SHLX is single-uop, 1 cycle latency, unlike legacy variable-count shifts that update FLAGS... except if CL=0. uops.info reports Alder Lake has 3 cycle latency for shlx, but that doesn't make much sense and is inconsistent with InstLatx64 measurements so I'm pretty sure SHLX/SHRX latency is still 1 cycle.)

So this has 3 cycle latency from a->output (just pdep)
and 4 cycle latency from n->output (shlx, pdep).

And is only 3 uops for the front-end.

A semi-related BMI2 trick:

pext(a,a) will pack the bits at the bottom, like (1ULL<<popcnt(a)) - 1 but without overflow if all bits are set.

Clearing the low N bits of that with an AND mask, and expanding with pdep would work. But that's an overcomplicated expensive way to create a source of bits with enough ones above N zeros, which is all that actually matters for pdep. Thanks to @harold for spotting this in the first version of this answer.

Without fast PDEP: perhaps binary search for the right popcount

@Nate's suggestion of a binary search for how many low bits to clear is probably a good alternative to pdep.

Stop when popcount(x>>c) == popcount(x) - N to find out how many low bits to clear, preferably with branchless updating of c. (e.g. c = foo ? a : b often compiles to cmov).

Once you're done searching, x & (-1ULL<<c) uses that count, or just tmp << c to shift back the x>>c result you already have. Using right-shift directly is cheaper than generating a new mask and using it every iteration.

High-performance popcount is relatively widely available on modern CPUs. (Although not baseline for x86-64; you still need to compile with -mpopcnt or -march=native).

Tuning this could involve choosing a likely starting-point, and perhaps using a max initial step size instead of pure binary search. Getting some instruction-level parallelism out of trying some initial guesses could perhaps help shorten the latency bottleneck.