I would like to know if the following is possible in any of the SIMD families of instructions.
I have a qword input with 63 significant bits (never negative). Each sequential 7 bits starting from the LSB is shuffle-aligned to a byte, with a left-padding of 1 (except for the most significant non-zero byte). To illustrate, I'll use letters for clarity's sake.
The result is only the significant bytes, thus 0 - 9 in size, which is converted to a byte array.
In: 0|kjihgfe|dcbaZYX|WVUTSRQ|PONMLKJ|IHGFEDC|BAzyxwv|utsrqpo|nmlkjih|gfedcba
Out: 0kjihgfe|1dcbaZYX|1WVUTSRQ|1PONMLKJ|1IHGFEDC|1BAzyxwv|1utsrqpo|1nmlkjih|1gfedcba
Size = 9
In: 00|nmlkjih|gfedcba
Out: |0nmlkjih|1gfedcba
Size = 2
I do understand the padding is separate. The shuffle-aligning is my question. Is this possible?
EDIT 2
Here is my updated code. Gets a sustained 46 M / sec for random-length input on single thread Core 2 Duo 2 GHz, 64 bit.
private static int DecodeIS8(long j, ref byte[] result)
{
if (j <= 0)
{
return 0;
}
int size;
// neater code: gives something to break out of
while (true)
{
result[0] = (byte)((j & 0x7F) | 0x80);
size = 0;
j >>= 7;
if (j == 0) break;
result[1] = (byte)((j & 0x7F) | 0x80);
size++;
j >>= 7;
if (j == 0) break;
result[2] = (byte)((j & 0x7F) | 0x80);
size++;
j >>= 7;
if (j == 0) break;
result[3] = (byte)((j & 0x7F) | 0x80);
size++;
j >>= 7;
if (j == 0) break;
result[4] = (byte)((j & 0x7F) | 0x80);
size++;
j >>= 7;
if (j == 0) break;
result[5] = (byte)((j & 0x7F) | 0x80);
size++;
j >>= 7;
if (j == 0) break;
result[6] = (byte)((j & 0x7F) | 0x80);
size++;
j >>= 7;
if (j == 0) break;
result[7] = (byte)((j & 0x7F) | 0x80);
size++;
j >>= 7;
if (j == 0) break;
result[8] = (byte)j;
return 9;
}
result[size] ^= 0x80;
return size + 1;
}
Yes, it's possible to use MMX/SSE's pmullw
instruction (intrinsic function: _mm_mullo_pi16
) to do per-element shifts.
The basic idea is to extract alternating 7-bit elements with an AND instruction and perform the pmullw
to shift the elements into place. This will accomplish the task for half of the elements, so the process will need to be repeated with a couple of extra shifts.
#include <stdio.h>
#include <stdint.h>
#include <mmintrin.h>
__m64 f(__m64 input) {
static const __m64 mask = (__m64) 0xfe03f80fe03f80UL;
static const __m64 multiplier = (__m64) 0x0080002000080002UL;
__m64 t0 = _mm_and_si64 (input, mask);
__m64 t1 = _mm_and_si64 (_mm_srli_si64 (input, 7), mask);
t0 = _mm_mullo_pi16 (t0, multiplier);
t1 = _mm_mullo_pi16 (t1, multiplier);
__m64 res = _mm_or_si64 (t0, _mm_slli_si64 (t1, 8));
/* set most significant bits, except for in most significant byte */
return _mm_or_si64 (res, (__m64) 0x0080808080808080UL);
}
int main(int argc, char *argv[])
{
int i;
typedef union {
__m64 m64;
unsigned char _8x8[8];
} type_t;
/* 0x7f7e7c7870608080 = {127, 63, 31, 15, 7, 3, 2, 1, 0} */
type_t res0 = { .m64 = f((__m64) 0x7f7e7c7870608080UL) };
for (i = 0; i < 8; i++) {
printf("%3u ", res0._8x8[i]);
}
puts("");
return 0;
}
The mask
extracts alternating 7-bit elements. The multiplier
is a constant which allows us to specify per-element shifts. It's derived from looking at the masked input:
00000000|dcbaZYX0|000000PO|NMLKJ000|0000BAzy|xwv00000|00nmlkji|h0000000
and realizing that
00000000|dcbaZYX0 needs to be shifted by 7 (or multiplied by 2^7, 128, 0x0080)
000000PO|NMLKJ000 needs to be shifted by 5 (or multiplied by 2^5, 32, 0x0020)
0000BAzy|xwv00000 needs to be shifted by 3 (or multiplied by 2^3, 8, 0x0008)
00nmlkji|h0000000 needs to be shifted by 1 (or multiplied by 2^1, 2, 0x0002)
This function writes 8-bytes at a time (instead of the 9-bytes your 9 7-bit elements would unpack to), so you'll have to advance the source pointer by only 7-bytes after each iteration. Because of this, a conversion to SSE2 is a bit more complicated.
I don't think it's possible to use a different mask and multiplier for t1
in order to avoid the shifts, since t1
's elements will cross 16-bit boundaries, which will prevent pmullw
from working. But, it still might be possible to optimize somehow.
I haven't benchmarked this, but I suspect it's significantly faster than your scalar version. If you benchmark it, please post the results. I'd be very interested to see them.
All in all, the algorithm comes out to be 2 shifts, 2 ors, 2 ands, and two multiplies (and a few moves) to generate 8-bytes.