[SOLVED] How can I optimize this bit-packing function in C?

How can I optimize this bit-packing function in C?

This code sample takes an arbitrarily large number of bytes as inputs, one at the time, and maps them to a table with 32 values [0,31]. To simplify the example I used mod 32. The >> 3 operation is equivalent to division by 8. In case it is not clear, the "q" loop counts to 5 because numbers from 0 to 31 only require 5-bits of resolution.

Can this be improved in terms of speed?

      uint8_t sto[10000]; 
      uint8_t srt[32] = {<32 values in total>};
      uint8_t t,q
      
      uint64_t loop,cr = 0;
      
      for(loop = 0; loop < 10000; loop++) { 
          t = srt[loop % 32];     
          for(q = 0; q < 5; q++) {
              if(t & (0x1 << (4 - q))) 
                  sto[cr >> 3] |= (0x1 << (7 - (cr % 8)));
              cr++;
            }
      }

Solution

The code can be reduced to simply:

for (size_t i = 0; i < 10000; ++i)
    sto[i] |= bits[i % NBytes];

after doing some work to prepare bits. That loop can be unrolled, can be converted to units wider than uint8_t, can be converted to SIMD code, and can be parallelized. After a few of those, such as converting to wider units, it might stream at memory bandwidth, so further optimization might not be useful.

The work to prepare bits is to consolidate the bits from srt into a full bitmask:

#define NSrt    32          //  Number of elements in srt.
#define NBits   (NSrt*5)    //  Number of bits used from srt.
#define NBytes  (NBits/8)   //  Number of bytes used by NBits.

/*  Consolidate the bits of srt into bytes, processing eight elements of
    srt (i) and five elements of bits (j) per iteration.
*/
uint8_t bits[NBytes];
for (size_t i = 0, j = 0; i < NSrt; i += 8, j += 5)
{
    /*  Get five bits from each srt elements.  (b0 does not need to be
        masked because its high bits will be shifted out of the uint8_t
        span.)
    */
    uint8_t
        b0 = srt[i + 0],
        b1 = srt[i + 1] & 0x1f,
        b2 = srt[i + 2] & 0x1f,
        b3 = srt[i + 3] & 0x1f,
        b4 = srt[i + 4] & 0x1f,
        b5 = srt[i + 5] & 0x1f,
        b6 = srt[i + 6] & 0x1f,
        b7 = srt[i + 7] & 0x1f;

    //  Merge the five-bit segments into eight-bit units.
    bits[j + 0] = b0 << 3 | b1 >> 2;
    bits[j + 1] = b1 << 6 | b2 << 1 | b3 >> 4;
    bits[j + 2] = b3 << 4 | b4 >> 1;
    bits[j + 3] = b4 << 7 | b5 << 2 | b6 >> 3;
    bits[j + 4] = b6 << 5 | b7;
}