The documentation for the parallel deposit instruction (PDEP
) in Intel's Bit Manipulation Instruction Set 2 (BMI2) describes the following serial implementation for the instruction (C-like pseudocode):
U64 _pdep_u64(U64 val, U64 mask) {
U64 res = 0;
for (U64 bb = 1; mask; bb += bb) {
if (val & bb)
res |= mask & -mask;
mask &= mask - 1;
}
return res;
}
See also Intel's pdep
insn ref manual entry.
This algorithm is O(n), where n is the number of set bits in mask
, which obviously has a worst case of O(k) where k is the total number of bits in mask
.
Is a more efficient worst case algorithm possible?
Is it possible to make a faster version that assumes that val
has at most one bit set, ie either equals 0 or equals 1<<r
for some value of r
from 0 to 63?
Just as an additional note, since this came up for me again, I found Sebastiano Vigna's method for finding the nth set bit to be faster in practice. It contains no branches or conditional moves.
Note that leq_bytes
and gt_zero_bytes
might be implementable using SSE instructions, but this version the advantage of being completely portable.
const uint64_t sMSBs8 = 0x8080808080808080ull;
const uint64_t sLSBs8 = 0x0101010101010101ull;
inline uint64_t
leq_bytes(uint64_t pX, uint64_t pY)
{
return ((((pY | sMSBs8) - (pX & ~sMSBs8)) ^ pX ^ pY) & sMSBs8) >> 7;
}
inline uint64_t
gt_zero_bytes(uint64_t pX)
{
return ((pX | ((pX | sMSBs8) - sLSBs8)) & sMSBs8) >> 7;
}
inline uint64_t find_nth_set_bit(uint64_t pWord, uint64_t pR)
{
const uint64_t sOnesStep4 = 0x1111111111111111ull;
const uint64_t sIncrStep8 = 0x8040201008040201ull;
uint64_t byte_sums = pWord - ((pWord & 0xA*sOnesStep4) >> 1);
byte_sums = (byte_sums & 3*sOnesStep4) + ((byte_sums >> 2) & 3*sOnesStep4);
byte_sums = (byte_sums + (byte_sums >> 4)) & 0xF*sLSBs8;
byte_sums *= sLSBs8;
const uint64_t k_step_8 = pR * sLSBs8;
const uint64_t place
= (leq_bytes( byte_sums, k_step_8 ) * sLSBs8 >> 53) & ~0x7;
const int byte_rank = pR - (((byte_sums << 8) >> place) & 0xFF);
const uint64_t spread_bits = (pWord >> place & 0xFF) * sLSBs8 & sIncrStep8;
const uint64_t bit_sums = gt_zero_bytes(spread_bits) * sLSBs8;
const uint64_t byte_rank_step_8 = byte_rank * sLSBs8;
return place + (leq_bytes( bit_sums, byte_rank_step_8 ) * sLSBs8 >> 56);
}