SIMD bit reordering of packed 12-bit integer array

I've got a large tightly packed array of 12-bit integers in the following repeating bit-packing pattern: (where n in An/Bn represents bit number and A and B are the first two 12-bit integers in the array)

|           byte0           |            byte1          |           byte2         | etc..
| A11 A10 A9 A8 A7 A6 A5 A4 | B11 B10 B9 B8 B7 B6 B5 B4 | B3 B2 B1 B0 A3 A2 A1 A0 | etc..

which I'm bit reordering into the following pattern:

|           byte0           |            byte1          |           byte2         | etc..
| A11 A10 A9 A8 A7 A6 A5 A4 | A3 A2 A1 A0 B11 B10 B9 B8 | B7 B6 B5 B4 B3 B2 B1 B0 | etc..

I have got it working in a per 3-byte loop with the following code:

void CSI2toBE12(uint8_t* pCSI2, uint8_t* pBE, uint8_t* pCSI2LineEnd)
{
    while (pCSI2 < pCSI2LineEnd) {
        pBE[0] = pCSI2[0];
        pBE[1] = ((pCSI2[2] & 0xf) << 4) | (pCSI2[1] >> 4);
        pBE[2] = ((pCSI2[1] & 0xf) << 4) | (pCSI2[2] >> 4);
        
        // Go to next 12-bit pixel pair (3 bytes)
        pCSI2 += 3;
        pBE += 3;
    }
}

but working with byte granularity isn't great for performance. The target CPU is a 64-bit ARM Cortex-A72 (Raspberry Pi Compute Module 4). For context, this code converts MIPI CSI-2 bit-packed raw image data to Adobe DNG's bit-packing.

I'm hoping I can get a considerable performance improvement using SIMD intrinsics but I'm not really sure where to start. I've got the SIMDe header to translate intrinsics so AVX/AVX2 solutions are welcome.

Solution

The NEON ld3 instruction is ideal for this; it loads 48 bytes and unzips them into three NEON registers. Then you just need a couple of shifts and ORs.

I came up with the following:

void vectorized(const uint8_t* pCSI2, uint8_t* pBE, const uint8_t* pCSI2LineEnd)
{
    while (pCSI2 < pCSI2LineEnd) {
        uint8x16x3_t in = vld3q_u8(pCSI2);
        uint8x16x3_t out;
        out.val[0] = in.val[0];
        out.val[1] = vorrq_u8(vshlq_n_u8(in.val[2], 4), vshrq_n_u8(in.val[1], 4));
        out.val[2] = vorrq_u8(vshlq_n_u8(in.val[1], 4), vshrq_n_u8(in.val[2], 4));
        vst3q_u8(pBE, out);
        pCSI2 += 48;
        pBE += 48;
    }
}

Try on godbolt.

With gcc, the generated assembly looks like what you would expect. (There is one mov that could be eliminated with better register allocation, but that's pretty minor.)

Unfortunately clang has what looks like a bizarre missed optimization, where it breaks the 4-bit right shift into a 3-bit and a 1-bit shift. I filed a bug.

In principle we can do a little better using sli, Shift Left and Insert, to effectively merge the OR with one of the shifts:

out.val[1] = vsliq_n_u8(vshrq_n_u8(in.val[1], 4), in.val[2], 4);
out.val[2] = vsliq_n_u8(vshrq_n_u8(in.val[2], 4), in.val[1], 4);

But since it overwrites its source operand, we pay for it with a couple extra movs. https://godbolt.org/z/TbzEEd1Pn. clang allocates registers more cleverly and only needs one mov.

Another option, which could be slightly faster, is to use sra, Shift Right and Accumulate, which does an add instead of an insert. Since the relevant bits are already zero here, this has the same effect. Oddly there is no sla.

out.val[1] = vsraq_n_u8(vshlq_n_u8(in.val[2], 4), in.val[1], 4);
out.val[2] = vsraq_n_u8(vshlq_n_u8(in.val[1], 4), in.val[2], 4);