algorithmassemblyx86bit-manipulationbmi

Portable efficient alternative to PDEP without using BMI2?


The documentation for the parallel deposit instruction (PDEP) in Intel's Bit Manipulation Instruction Set 2 (BMI2) describes the following serial implementation for the instruction (C-like pseudocode):

U64 _pdep_u64(U64 val, U64 mask) {
  U64 res = 0;
  for (U64 bb = 1; mask; bb += bb) {
    if (val & bb)
      res |= mask & -mask;
    mask &= mask - 1;
  }
  return res;
}

See also Intel's pdep insn ref manual entry.

This algorithm is O(n), where n is the number of set bits in mask, which obviously has a worst case of O(k) where k is the total number of bits in mask.

Is a more efficient worst case algorithm possible?

Is it possible to make a faster version that assumes that val has at most one bit set, ie either equals 0 or equals 1<<r for some value of r from 0 to 63?


Solution

  • Just as an additional note, since this came up for me again, I found Sebastiano Vigna's method for finding the nth set bit to be faster in practice. It contains no branches or conditional moves.

    Note that leq_bytes and gt_zero_bytes might be implementable using SSE instructions, but this version the advantage of being completely portable.

    const uint64_t sMSBs8 = 0x8080808080808080ull;
    const uint64_t sLSBs8 = 0x0101010101010101ull;
        
    inline uint64_t
    leq_bytes(uint64_t pX, uint64_t pY)
    {
        return ((((pY | sMSBs8) - (pX & ~sMSBs8)) ^ pX ^ pY) & sMSBs8) >> 7;
    }
        
        
    inline uint64_t
    gt_zero_bytes(uint64_t pX)
    {
        return ((pX | ((pX | sMSBs8) - sLSBs8)) & sMSBs8) >> 7;
    }
        
        
    inline uint64_t find_nth_set_bit(uint64_t pWord, uint64_t pR)
    {
        const uint64_t sOnesStep4  = 0x1111111111111111ull;
        const uint64_t sIncrStep8  = 0x8040201008040201ull;
    
        uint64_t byte_sums = pWord - ((pWord & 0xA*sOnesStep4) >> 1);
        byte_sums = (byte_sums & 3*sOnesStep4) + ((byte_sums >> 2) & 3*sOnesStep4);
        byte_sums = (byte_sums + (byte_sums >> 4)) & 0xF*sLSBs8;
        byte_sums *= sLSBs8;
        
        const uint64_t k_step_8 = pR * sLSBs8;
        const uint64_t place
            = (leq_bytes( byte_sums, k_step_8 ) * sLSBs8 >> 53) & ~0x7;
        const int byte_rank = pR - (((byte_sums << 8) >> place) & 0xFF);
        const uint64_t spread_bits = (pWord >> place & 0xFF) * sLSBs8 & sIncrStep8;
        const uint64_t bit_sums = gt_zero_bytes(spread_bits) * sLSBs8;
        const uint64_t byte_rank_step_8 = byte_rank * sLSBs8;
        return place + (leq_bytes( bit_sums, byte_rank_step_8 ) * sLSBs8 >> 56);
    }