Emulate AVX512 VPCOMPRESSB byte packing without AVX512_VBMI2

I have populated a zmm register with an array of byte integers from 0-63. The numbers serve as indices into a matrix. Non-zero elements represent rows in the matrix that contain data. Not all rows will contain data.

I am looking for an instruction (or instructions) to do the same thing with a byte array in a zmm register as VPCOMPRESSQ does with a qword array in a zmm register.

Consider a situation where rows 3, 6, 7, 9 and 12 contain data and all the other rows (a total of 64 rows or fewer) are empty. Mask register k1 now contains 64 bits set to 001001001001000 ... 0.

With the VPCOMPRESSQ instruction, I can use k1 to compress the zmm register so that non-zero elements are arranged contiguously from the start of the register, but there is no equivalent instruction for bytes.

PSHUFB appears as a candidate, but the shuffle control mask must be integers, not bits. I can get an integer array 0 0 3 0 0 6, etc, but I still have elements set to zero, and I need them arranged contiguously from the start of a zmm register.

So my question is, given a zmm register with a byte array where k1 is set as shown above, how can I get the non-zero elements arranged contiguously from the beginning of the zmm register, like VPCOMPRESSQ does for qwords?

AVX512VBMI2 has VPCOMPRESSB, but that's only available in Ice Lake and later. What can be done on Skylake-avx512 / Cascade Lake?

Solution

Here's what I use as AVX2 fallbacks for both vpcompressb and vpexpandb. It's based on the solutions posted in this previous question.

Note that Decompress() may load out of bounds and fault when crossing a page. If this is an issue, the input can be copied into a temporary buffer like done in Compress().

void MaskCompress_AVX2(const uint8_t src[64], uint8_t* dest, uint64_t mask) {
    alignas(64) uint8_t temp[64];
    int j = 0;
    for (int i = 0; i < 64; i += 16) {
        // ABC... -> AAAABBBBCCCC...: replicate each bit to fill a nibble
        uint64_t extended_mask = _pdep_u64(mask, 0x1111'1111'1111'1111) * 0xF;
        uint64_t dest_idx_nib = _pext_u64(0xFEDC'BA98'7654'3210, extended_mask);

        __m128i dest_idx = _mm_cvtsi64_si128((int64_t)dest_idx_nib);
        dest_idx = _mm_unpacklo_epi8(dest_idx, _mm_srli_epi16(dest_idx, 4));
        dest_idx = _mm_and_si128(dest_idx, _mm_set1_epi8(15));

        // load will never be out of bounds because `j < i = always true`
        auto values = _mm_shuffle_epi8(_mm_loadu_si128((__m128i*)&src[i]), dest_idx);
        _mm_storeu_si128((__m128i*)&temp[j], values);

        j += _mm_popcnt_u32(mask & 0xFFFF);
        mask >>= 16;
    }
    __builtin_memcpy(dest, temp, (uint32_t)j);
}
void MaskDecompress_AVX2(const uint8_t* src, uint8_t dest[64], uint64_t mask) {
    for (int i = 0, j = 0; i < 64; i += 16) {
        // ABC... -> AAAABBBBCCCC...: replicate each bit to fill a nibble
        uint64_t extended_mask = _pdep_u64(mask, 0x1111'1111'1111'1111) * 0xF;
        uint64_t dest_idx_nib = _pdep_u64(0xFEDC'BA98'7654'3210, extended_mask);

        __m128i dest_idx = _mm_cvtsi64_si128((int64_t)dest_idx_nib);
        dest_idx = _mm_unpacklo_epi8(dest_idx, _mm_srli_epi16(dest_idx, 4));
        dest_idx = _mm_and_si128(dest_idx, _mm_set1_epi8(15));

        __m128i zero_mask = _mm_cvtsi64_si128((int64_t)~extended_mask);
        zero_mask = _mm_unpacklo_epi8(zero_mask, _mm_srli_epi16(zero_mask, 4));
    
        // shuffle_epi8 outputs zeroes when high index bit is set
        dest_idx = _mm_or_si128(dest_idx, _mm_slli_epi16(zero_mask, 4));

        // load will never be out of bounds because `j < i = always true`
        auto values = _mm_shuffle_epi8(_mm_loadu_si128((__m128i*)&src[j]), dest_idx);
        _mm_storeu_si128((__m128i*)&dest[i], values);

        j += _mm_popcnt_u32(mask & 0xFFFF);
        mask >>= 16;
    }
}

void MaskCompress_Scalar(const uint8_t src[64], uint8_t* dest, uint64_t mask) {
    for (uint32_t i = 0, j = 0; mask != 0; i++) {
        dest[j] = src[i];
        j += (mask & 1);
        mask >>= 1;
    }
}
void MaskDecompress_Scalar(const uint8_t* src, uint8_t dest[64], uint64_t mask) {
    for (uint32_t i = 0, j = 0; i < 64; i++) {
        uint8_t v = src[j];
        dest[i] = (mask & 1) ? v : 0;
        j += (mask & 1);
        mask >>= 1;
    }
}

Some benchmarks from my tigerlake CPU (masks are generated at random):

ns/op	op/s	err%	total	benchmark
2.69	371,063,085.72	1.9%	0.04	`Compress AVX512`
10.22	97,820,644.12	1.2%	0.17	`Compress AVX2`
24.36	41,053,932.84	0.7%	0.39	`Compress Scalar`
1.09	918,672,592.02	1.2%	0.03	`Decompress AVX512`
6.44	155,358,119.20	0.5%	0.11	`Decompress AVX2`
24.83	40,274,581.73	0.7%	0.40	`Decompress Scalar`

compressstore is apparently twice as slow compared to compress+store, so it might be a good idea to avoid it when possible.