c bit-manipulation micro-optimization bit-packing

Bit packing of groups of n repeated bits in a 32-bit word, compact to 1 bit per group

I have an unsigned 32-bit integer value that contains multiple groups of bits. All bits within one group are always the same. Each group has the same size n (from 1 to 32 and is unknown at compile time). There are no gaps between the groups.

I need to convert each group of these bits into one bit. Next, I need to pack these bits as depicted below:

I have come up with the following C function that iterates over these bits and packs them for an arbitrary n between 1 and 32.

uint32_t PackBits(uint32_t value, uint8_t n)
{
    if (n<2) return value;

    uint32_t res = 0;
    uint32_t mask = 1;  // Also works for: mask= 1 << n-1; mask= 1 << n-2; mask= 1 << n-3; ... ; mask= 1 << n-n; 
    uint8_t s =  0;

    do {        
        res |= (value & mask) >> s;
        mask <<= n;
        s += n-1;
    } while (mask);

    return res;
}

Q: Is there a way to accomplish this goal in C without looping over these bits, with some clever sequence of multiplications, De Bruijn¹, carryless additions, xor, negations, etc... ?

For example, by taking advantage of the following bit-gathering property of multiplication, where a,b,c,d denote the least significant bits of each input byte:

   byte 3  byte 2  byte 1  byte 0
  ├──────┤├──────┤├──────┤├──────┤
  0000000d0000000c0000000b0000000a  (variable least-significant bits)
× 00001000000001000000001000000001  (constant multiplier 0x08040201 for n==8)
  ────────────────────────────────
= 0000abcd00000abc000000ab0000000a  (the product modulo 2^32)

Similar bit-gathering operations can be made with different constant multipliers for n between 6 and 31.

Maybe a similar bit-gathering operations can be made with division, too.

Footnote 1: De Bruijn sequences as perfect hash functions:

For n==2, the operation (value*0x2AAB & 0x7FFF8000) >> 15 generates a UNIQUE 16-bit number which could index a 64K element LUT, or have something done to map it to the desired bits. 64K of uint16_ts would fit in L2 cache on modern CPUs, but would only be reasonable if we were doing this for a lot of values in a row so it would pay for the demand misses to get data into L2 / L1d cache in the first place, and pay for the later cache misses from stuff that wouldn't have been evicted by other strategies that run more instructions but avoid a LUT.

For n==3, (value*0x40024B7 & 0x0FFC0000) >> 18 generates a UNIQUE 10-bit value (which can index a 1024 element LUT).
For n==4, (value*0x22AAF & 0x1FE00000) >> 21 generates a UNIQUE 8-bit value (which can index a 256 element LUT). A 256 element LUT is actually reasonable, especially since its entries are only uint8_t in size.
For n==5, (value*0x84ADF & 0x03F00000) >> 20 generates a UNIQUE 6-bit value (which can index a 64 element LUT). A 64 element LUT is actually reasonable, especially since its entries are only uint8_t in size.

(These numbers haven't been checked carefully for uniqueness)

Solution

Summary: Use x86 BMI2 pext or equivalent if available (except on AMD Zen 2 and earlier where it's very slow), with a 32-entry LUT of uint32_t masks for n=1..32

Otherwise, special case n<4 with custom functions,
n>=4 can use a LUT of a mask, multiplier, and shift count for each n.
n>=11 can use one shift and AND to keep 1 bit, or 2 that span the boundary between 2 groups
Tested version on Godbolt that handles all n.

...ddddccccbbbbaaaa masked to keep adjacent pairs like
...000dc000000ba000 gives us half as many groups to deal with vs keeping the low or high bit of each group. A magic-multiplier LUT works down to n=4, vs. n=6 in @MSalters's answer.

Starting with bit-pairs also helps significantly for pack2 and pack3.

Full code, tested for all `n` with random inputs

All of this on Godbolt (and a bit more, including SSE2 / SSE4.1 versions that might be worth using for n=2, 4, and 8, and a unit test, and an AVX-512 + GFNI idea for n=2). If you care about a specific compiler and/or CPU, there are probably some tweaks to convince a compiler to emit slightly better asm for some cases.

edit changelog: much faster test code for making all bits in a group equal, and faster looping versions. Various tuning: dispatch to the LUT seems good for n=4..16, don't use the n=11..16 function so the switch has fewer different cases (a couple cmp/jcc instead of a jump table)

#include <stdint.h>
#define NDEBUG  // asserts are mostly for documentation about which n ranges different functions handle
#include <assert.h>
#include <stdalign.h>

// n=17..32 is just x&1
uint32_t pack_n_11_to_16(uint32_t x, unsigned n)
{
    assert(n>=11 && n<=16);
    x >>= n-1;
    return x & 3;
}

// unsigned long n avoids the need to zero-extend to pointer width, if it didn't inline.
static inline
uint32_t pack_LUT(uint32_t x, unsigned long n)
{
    assert(n>=4 && n<=16);  // table currently only goes up to n=16
    struct LUT {
        // group both arrays into one static variable so compilers don't waste instructions getting the base address separately for each.
        uint8_t shift[32-4];  // counts first so the offset to next array is small, fitting in a disp8
        alignas(8)
        struct {uint32_t mask, mult;} maskmult[]; // flexible array member that's as long as we want it to be.
    };
    static const struct LUT lut = {
      .maskmult = {
        {0b00011000000110000001100000011000, ((1uL<<18) | (1u<<12) | (1u<<6) | 1)<<3 }, // n=4 ; 8 groups = 4 pairs
        {0b00000011000000001100000000110000, ((1uL<<16) | (1u<<8)  | 1)<<6 },  // n=5 ; 6 groups = 3 pairs
        {0b00100000000001100000000001100000, ((1uL<<20) | (1u<<10) | 1)<<2 },  // n=6 ; 5 groups = 3ish pairs
        {0b00000000001100000000000011000000, ((1uL<<12) | 1)<<0 }, // n=7 ; 4 groups, 2 pairs // no high garbage even without shifting to the top, shift=18
        {0b00000001100000000000000110000000, ((1uL<<14) | 1)<<0 }, // n=8 ; 4 groups, 2 pairs
        {0b00000100000000000000001100000000, ((1uL<<16) | 1)<<0 }, // n=9 ; 3 groups, 2ish pairs
        {0b00100000000000000000011000000000, ((1uL<<18) | 1)<<0 }, // n=10 ; 3 groups, 2ish pairs
        {0b00000000000000000000110000000000, 1}, // n=11   one pair, don't need to multiply anymore if we want to dispatch to a third version.
        {0b00000000000000000001100000000000, 1}, // n=12
        {3<<12, 1}, // n=13
        {3<<13, 1}, // n=14
        {3<<14, 1}, // n=15
        {3<<15, 1}, // n=16
        // Use a simpler strategy for larger n to save table size, even though dispatcher has more to choose from.
      },
      .shift = {
        32-8, 32-6, 32-5,  // n = 4..6 shift from the top to clear high garbage
        18, 21, 24, 27,    // n=7..10 have a second bit-pair (or single bit) at the top
        // n=11 and up just have 2 or fewer groups, trivial shift and keep only 2 or 1 bit
        10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
        // could choose multipliers so these are all 30
      }
    };
    x &= lut.maskmult[n-4].mask;
    x *= lut.maskmult[n-4].mult;
    x >>= lut.shift[n-4];
    return x;
}

On Intel P-cores (i.e. Sandybridge-family), variable-count shifts are 3 uops due to having to preserve FLAGS for the count=0 case. (https://uops.info/) BMI2 shrx avoids that, but if we had BMI2 on an Intel CPU we'd just use pext. Still, for a loop using the same n repeatedly, pack_n_ge_11 should be very good for large n, avoiding a multiply compared to the LUT version.

And the special cases for n=2 and n=3:

uint32_t pack2(uint32_t x)
{
        // 0po00nm00lk00ji00hg00fe00dc00ba0   AND mask to keep bit-pairs
    x &= 0b01100110011001100110011001100110;
        // 0ponm0000lkji0000hgfe0000dcba000    x += x<<2 /*LEA*/;  x &= mask;
    x += x<<2;
    x &= 0b01111000011110000111100001111000;
        // 0ponmlkji00000000hgfedcba0000000    x += x<<4;  x &= mask2;
    x += x<<4;
    x &= 0b01111111100000000111111110000000;
        // ponmlkjihgfedcba0000000000000000    x *= (1<<1) | (1<<9)
    x *= (1u<<1)|(1u<<9);
    x >>= 16;
    return x;
}

uint32_t pack3(uint32_t x) {
        // 0000j00i00h00g00f00e00d00c00b00a // incoming x
    x &= 0b00001100001100001100001100001100;
        //     ji    hg    fe    dc    ba
    x += x<<4;      // x *= 17
        // ji  jihg  hgfe  fedc  dcba  ba
    x &= 0b00001111000000001111000000001111;
        //     jihg        fedc        ba
        //         fedc        ba   // partial product from 1<<8
        // fedc        ba           // partial product from 1<<16
        // ((1<<0) | (1<<8) | (1<<16))<<4
    x *= (1<<4) | (1<<12) | (1<<20);
    // with the multiplier left-shifted by 4, result goes to the top, no high garbage after shifting
    x >>= 22;
    // x &= 0x3FF;  // 10-bit.  
    return x;
}

To dispatch to these functions according to n, I made a wrapper that you'd actually call. Make the others static inline. Build with profile-guided optimization if you want to hint the compiler which n values are common in your use-cases, so it favours them in the chain of compare/branch. If you're using this in a loop, make sure it can inline. Ideally you'd want loop unswitching, where the compiler turns this into multiple loops with branching to select one loop, instead of branching every loop iteration. But at least the LUT loads and constant setup and so on can be hoisted.

uint32_t pack_n(uint32_t x, unsigned n)
{
#ifdef HAVE_FAST_PEXT
    return pack_pext(x, n);
#endif
    switch (n){
        default:  return x & 1; // one group for n >= 17

        case 11 ... 16: //return pack_n_11_to_16(x, n);
        case 8:  // return pack8_sse2(x);
        case 10:
        case 9:
        case 7:
        case 6:
        case 5:
        case 4:
            return pack_LUT(x, n);

        case 3:   return pack3(x);
        case 2:
#ifdef __SSE4_1__
            return pack2_sse4(x);
#else
            return pack2(x);
#endif
        case 1:   return x;  // trivial
    }
}

And to see how the masking works before I made the LUT version, I made n=4, n=5, and two n=8 versions. (One n=8 version uses the bitpair trick, the other starts with 4 separate 1-bit groups.) AArch64 does very well with the bitpairs pack8, avoiding having to construct a 32-bit multiplier constant. (I'm assuming most AArch64 CPus have efficient support for orr reg, reg, reg, lsl #imm with a shifted source operand.)

// see pack5 and pack8 on Godbolt
uint32_t pack4(uint32_t x) {
        // hhhhggggffffeeeeddddccccbbbbaaaa
    x &= 0b00011000000110000001100000011000;
        // 000hg000000fe000000dc000000ba000  // result, and 1<<0 partial product
        // 00000fe000000dc000000ba000000000  // 1<<6 partial product
        // 0000000dc000000ba000000000000000  // 1<<12
        // 0dc000000ba000000000000000000000  // 1<<18
        // with an extra shift by 3 to put the result at the top, avoiding high garbage
    x *= (1uL<<21) | (1uL<<15) | (1u<<9) | (1u<<3);
    x >>= (32-8);
    return x;
}

`pext` - parallel bits-extract

A few ISAs like x86 with BMI2 pext/pdep have hardware support for bit pack / deposit. If supported, this is ideal. (Except on AMD Zen 2 and earlier where it's supported only by very slow microcode, as bad as one per 50 to 150 cycle throughput). See also my answer on a related question, which mentions AArch64 SVE2 BITPERM on SIMD vector elements, and RISC-V extension v vcompress.vm which packs with byte granularity based on a mask.
Currently there are no portable idioms which compilers recognize and turn into pext when it's available on the target, so you can only get it via an intrinsic).

static const uint32_t extract_masks[32] = {-1uL /* n=1 */, 0x55555555uL /*n=2*/, ...};
return  _pext_u32(x, extract_masks[n-1]);

If hardware support for pext/pdep isn't available, you don't want a generic software emulation of it with a table of masks for each n, that would be slow because it's a hard operation to emulate. (You'd be encoding the bit-positions to extract into a mask that needs decoding, instead of into instructions that actually do the operations.)

n=8 can be done on x86 with movd xmm0, edi / pmovmskb eax, xmm0 to extract the high bit of each byte into a bitmask. (_mm_cvtsi32_si128 / _mm_movemask_epi8). If that case is common and important for performance, could be worth considering, especially if you have an array of multiple inputs of the same n to do so you can actually use SIMD to do 4 or 8 at once (and separate the 4-bit nibbles of the result if needed.)

See pack2_sse4, pack4_sse4 and pack8_sse2 in the Godbolt link above for possibilities. Some might be worth using for compile-time-constant n for a single x at a time, but not with extra branching to special-case the dispatching. pmaddwd or SSSE3 pmaddubsw could also be useful for moving bits within 16 or 32-bit elements.

You mentioned graphics and huge bitmaps; decoding 2 or more inputs in parallel is very possible, especially with AVX2 for 32-byte vectors. If that's your real problem, ask a separate question; decoding one at a time sucks; do 2 at a time with uint64_t at least for larger n. I've mostly been assuming one-off use cases in the rest of this answer since you only asked about decoding one at once.

With AVX-512VBMI, vpmultishiftqb can get the desired bit to a byte within a 64-bit element of a vector, where you can get them into a mask. Broadcast-load and/or shuffle to feed it if necessary. Or with AVX512-BITALG, vpshufbitqmb goes directly to a mask instead of grabbing 8-bit groups into vectors. https://www.felixcloutier.com/x86/vpshufbitqmb

http://0x80.pl/articles/avx512-galois-field-for-bit-shuffling.html discusses GFNI for shuffling bits within a byte, great for vectorizing small n, especially n=2, to produce a vector result. See pack2_GFNI_AVX2 on Godbolt for a sketch of an idea.

For large-enough n (fewer wider groups), AND masking and then moving bits around with magic multiplier constants as in How to create a byte out of 8 bool values (and vice versa)? works very well, and can do the job in one AND/multiply/shift step for n>=4

A multiply adds multiple shifted copies of one input, according to the set bit positions in the multiplier. If the bits you care about are few enough and far enough apart, you can avoid overlap of partial-products that could flip a bit you want to deposit directly or via carry-out from lower bits. Or can be used to horizontally sum byte in a 32 or 64-bit integer, for example, by having all the partial products align a different byte into the top byte.

For any given multiplier that puts bits adjacent to each other, we can left-shift the multiplier to bake that left-shift into the multiply. Placing the bits at the top lets a right shift bring them to the bottom zero-extended, with no high garbage.

The shift count is 32 - (32/n), but we use another LUT because division is expensive. So 12 B instead of 8 per struct, or a separate uint8_t array which is probably better. (Smaller total cache footprint for the LUT if many different ns are frequently used, and the array of mask and multiplier can then have 8-byte elements. Cheaper addr math, and can be naturally aligned so both are in one cache line, and AArch64 ldp load-pair won't split across cache lines.)

MSalters' answer suggests a way to use one lookup table and the same code for all n>=6, but apparently that bit-reverses the result? That's fine on ARM, use rbit + and instead of a right shift, otherwise it's a problem.

@Konstantin Makarov's answer shows that we can separate into odd/even and combine at the end to extend the magic-multiplier idea to smaller n with special-case functions, but starting with bit-pairs can do better. I used his test harness and comment style as a starting point for my versions.

Efficient pack3, about 6 operations including 1 multiply

I came up with a more efficient pack3 (above) which starts out by masking to get 5 bit-pairs, spanning the boundaries between 3-bit groups. (The high bit of one group is next to the low bit of the next group.) It compiles nicely:

# clang 19 -O3
pack3:
  and   edi, 204522252
  mov   eax, edi
  shl   eax, 4           # shifting EDI would have taken MOV off the critical path; Ice Lake has mov-elim disabled
  or    eax, edi         # note that clang turned + into OR, seeing that no set bits can overlap
  and   eax, 251719692
  imul  eax, eax, 1052688
  shr   eax, 22
  ret

Instruction-level parallelism is poor; this is one chain of dependent work. But it's quite short so can overlap with surrounding code nicely.

For x86-64 as a non-inline function, this is about as efficient as possible. x86 can multiply by 3, 5, and 9 with LEA (e.g. lea eax, [rdi + rdi*8]), but it's only a 2-bit shift count so 17 needs either an imul eax, edi, 17 (1 uop but 3 cycle latency) or 3 instructions (including a mov) with 2 cycle latency assuming mov-elimination happens. Compilers prefer lower latency.

x86 allows 32-bit immediates for all these instructions. AArch64 needs to materialize the 3 constants with 2 instructions each (mov + movk), because its bitwise insns can only encode repeating patterns of contiguous set bits in 2, 4, 8, 16, or 32-bit groups, not odd lengths. Once constants are set up (e.g. hoisted out of a loop), orr w8, w8, w8, lsl #4 to shift-and-OR in a single instruction is better than x86 can do.

`n==2` with repeated shift/OR to make more bits contiguous

n=2 would be doable in log2(32/n = 16) = 4 mask/shift/OR steps just straight-forwardly closing up gaps between pairs of groups to get half as many groups that are twice as wide.

Starting with bit-pairs skips basically skips the first step, leaving only 3. It compiles pretty efficiently for x86-64 and AArch64:

# x86-64 gcc -O3
pack2:
  and   edi, 1717986918
  lea   edx, [rdi+rdi*4]
  and   edx, 2021161080
  mov   eax, edx
  sal   eax, 4
  add   eax, edx
  and   eax, 2139127680
  imul  eax, eax, 514
  shr   eax, 16
  ret

This is 8 ALU operations, 9 uops total including the mov for x86-64. (Also not counting ret since it inlines). Clang makes it 9 instructions for AArch64, but avoids a multiply.

# armv8 clang -O3
pack2:
  and   w8, w0, #0x66666666   # repeating bit-patterns can be immediates
  orr   w8, w8, w8, lsl #2
  and   w8, w8, #0x78787878
  orr   w8, w8, w8, lsl #4    # x |= x<<4;
  and   w8, w8, #0x7f807f80
  lsr   w9, w8, #7           # instead of multiply, 2 shifts
  lsl   w8, w8, #1
  bfi   w8, w9, #16, #8      # to set up for bitfield-insert
  lsr   w0, w8, #16
  ret

Left shifting by 1 to start with could simplify the end to just another orr with a shifted source (x += x<<16;). In a loop with a constant in a register, the first instruction could be and w8, w15, w0, lsl #1 on AArch64 and ARM32, but on other ISAs that shift would be part of the critical-path latency.

If pack2 had to start from 0p0o0n0m0l0k0j0i0h0g0f0e0d0c0b0a evenly-spaced bits instead of bit-pairs, x += x & 0x11111111; could left shift the odd bits, like shifting a up next to b etc. to get to the same point as with x & 0b0110.... Adding a bit to itself shifts it left by 1, but adding 0 leaves it unchanged.
But x86 can do the obvious x += x<<1; x &= mask; with LEA + AND vs. MOV / AND / ADD.

Bit packing of groups of n repeated bits in a 32-bit word, compact to 1 bit per group

Full code, tested for all n with random inputs

pext - parallel bits-extract

Efficient pack3, about 6 operations including 1 multiply

n==2 with repeated shift/OR to make more bits contiguous

Full code, tested for all `n` with random inputs

`pext` - parallel bits-extract

`n==2` with repeated shift/OR to make more bits contiguous