extract non-zero elements from m512i/m256i vector

Given a __m512i vector that contain 64 char elements:

index: 0,  1, 2, 3, 4,  5, 6, 7, 8, 9, 10,...
value: 1, -1, 1, 0, 0, -1, 1, 1, 0, 0, 1,...

(Note: the values of elements are between [-1, 1]).

Is there any elegant way to extract all non-zero elements and pack them into another __m512i vector like this:

expected output: 1, -1, 1, -1, 1, 1, 1,...

My naive approach is: non_zero_mask = _mm512_test_epi8_mask(X,X);. Then traversing through the mask with a while loop to add each element to the new vector one by one (yes, it's relative slow)

Solution

AVX-512VBMI2 (Ice Lake and later) has vpcompressb to left-pack according to a mask (such as yours from _mm512_test_epi8_mask(X,X)). It costs 2 uops (for port 5 on Intel) but is still far better than anything you could do without it.

Before that, only dword and qword element size were supported in AVX-512F. My AVX512 answer on AVX2 what is the most efficient way to pack left based on a mask? shows how to use the ps version; a byte version should work the same way. vpcompressb into a ZMM register, and do partially-overlapping 64-byte stores, incrementing the pointer by _popcnt_u64(mask). The memory-destination version of vpcompressb/w/d/q is slow, especially on Zen 4, so just have room in the destination for a full 64-byte store.

(My AVX2 answer there uses BMI2 pdep on 8-byte integers to create shuffle masks, but that won't work for elements narrower than 32-bit. Denis Yaroshevskiy's answer has some experiments on Coffee Lake with various element sizes including 8-bit char, SIMD compare and iterating over the set bits in a mask, with benchmarks of performance vs. fraction of elements removed.)

Without a compress instruction, left-packing is indeed hard, that's why it's a valuable primitive operation to have as a building block.

Depending on your data density, without AVX-512VBMI2 you might consider unpacking bytes to dwords for vpcompressd and using vpmovdb to narrow back to 8-bit before storing.

Maybe testing 64 bytes at once and use 3x kshiftrq to make inputs for the next 4 compresses? vptestmd and kshift are both port-5-only on Intel competing with vpcompress's 2p5 uops, but Zen 4 can run kshiftrq on different ports. (https://uops.info/) But if you're expanding the data as you load it, like with vpmovzxbd, you'd never have the 64 bytes in a single vector register, so yeah you'd want to _mm512_test_epi32_mask on each vector separately instead of spending even more shuffles to widen e.g. the second 128-bit lane of a __m512i without vpermb which require AVX-512VBMI which was introduced on the same CPUs as VBMI2. (https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512)

I don't think the simplicity of your condition (non-zero) opens up any good possibilities for doing this a different way, e.g. with shuffles and blends or something.

extract non-zero elements from __m512i/__m256i vector

extract non-zero elements from m512i/m256i vector