This code sample takes an arbitrarily large number of bytes as inputs, one at the time, and maps them to a table with 32 values [0,31]. To simplify the example I used mod 32. The >> 3 operation is equivalent to division by 8. In case it is not clear, the "q" loop counts to 5 because numbers from 0 to 31 only require 5-bits of resolution.
Can this be improved in terms of speed?
uint8_t sto[10000];
uint8_t srt[32] = {<32 values in total>};
uint8_t t,q
uint64_t loop,cr = 0;
for(loop = 0; loop < 10000; loop++) {
t = srt[loop % 32];
for(q = 0; q < 5; q++) {
if(t & (0x1 << (4 - q)))
sto[cr >> 3] |= (0x1 << (7 - (cr % 8)));
cr++;
}
}
The code can be reduced to simply:
for (size_t i = 0; i < 10000; ++i)
sto[i] |= bits[i % NBytes];
after doing some work to prepare bits
. That loop can be unrolled, can be converted to units wider than uint8_t
, can be converted to SIMD code, and can be parallelized. After a few of those, such as converting to wider units, it might stream at memory bandwidth, so further optimization might not be useful.
The work to prepare bits
is to consolidate the bits from srt
into a full bitmask:
#define NSrt 32 // Number of elements in srt.
#define NBits (NSrt*5) // Number of bits used from srt.
#define NBytes (NBits/8) // Number of bytes used by NBits.
/* Consolidate the bits of srt into bytes, processing eight elements of
srt (i) and five elements of bits (j) per iteration.
*/
uint8_t bits[NBytes];
for (size_t i = 0, j = 0; i < NSrt; i += 8, j += 5)
{
/* Get five bits from each srt elements. (b0 does not need to be
masked because its high bits will be shifted out of the uint8_t
span.)
*/
uint8_t
b0 = srt[i + 0],
b1 = srt[i + 1] & 0x1f,
b2 = srt[i + 2] & 0x1f,
b3 = srt[i + 3] & 0x1f,
b4 = srt[i + 4] & 0x1f,
b5 = srt[i + 5] & 0x1f,
b6 = srt[i + 6] & 0x1f,
b7 = srt[i + 7] & 0x1f;
// Merge the five-bit segments into eight-bit units.
bits[j + 0] = b0 << 3 | b1 >> 2;
bits[j + 1] = b1 << 6 | b2 << 1 | b3 >> 4;
bits[j + 2] = b3 << 4 | b4 >> 1;
bits[j + 3] = b4 << 7 | b5 << 2 | b6 >> 3;
bits[j + 4] = b6 << 5 | b7;
}