Given a __m512i
vector that contain 64 char
elements:
index: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,...
value: 1, -1, 1, 0, 0, -1, 1, 1, 0, 0, 1,...
(Note: the values of elements are between [-1, 1]).
Is there any elegant way to extract all non-zero elements and pack them into another __m512i
vector like this:
expected output: 1, -1, 1, -1, 1, 1, 1,...
My naive approach is: non_zero_mask = _mm512_test_epi8_mask(X,X);
. Then traversing through the mask with a while
loop to add each element to the new vector one by one (yes, it's relative slow)
AVX-512VBMI2 (Ice Lake and later) has vpcompressb
to left-pack according to a mask (such as yours from _mm512_test_epi8_mask(X,X)
). It costs 2 uops (for port 5 on Intel) but is still far better than anything you could do without it.
Before that, only dword and qword element size were supported in AVX-512F. My AVX512 answer on AVX2 what is the most efficient way to pack left based on a mask? shows how to use the ps
version; a byte version should work the same way. vpcompressb
into a ZMM register, and do partially-overlapping 64-byte stores, incrementing the pointer by _popcnt_u64(mask)
. The memory-destination version of vpcompressb/w/d/q
is slow, especially on Zen 4, so just have room in the destination for a full 64-byte store.
(My AVX2 answer there uses BMI2 pdep
on 8-byte integers to create shuffle masks, but that won't work for elements narrower than 32-bit. Denis Yaroshevskiy's answer has some experiments on Coffee Lake with various element sizes including 8-bit char
, SIMD compare and iterating over the set bits in a mask, with benchmarks of performance vs. fraction of elements removed.)
Without a compress
instruction, left-packing is indeed hard, that's why it's a valuable primitive operation to have as a building block.
Depending on your data density, without AVX-512VBMI2 you might consider unpacking bytes to dwords for vpcompressd
and using vpmovdb
to narrow back to 8-bit before storing.
Maybe test
ing 64 bytes at once and use 3x kshiftrq
to make inputs for the next 4 compresses? vptestmd
and kshift
are both port-5-only on Intel competing with vpcompress
's 2p5 uops, but Zen 4 can run kshiftrq
on different ports. (https://uops.info/) But if you're expanding the data as you load it, like with vpmovzxbd
, you'd never have the 64 bytes in a single vector register, so yeah you'd want to _mm512_test_epi32_mask
on each vector separately instead of spending even more shuffles to widen e.g. the second 128-bit lane of a __m512i
without vpermb
which require AVX-512VBMI which was introduced on the same CPUs as VBMI2. (https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512)
I don't think the simplicity of your condition (non-zero) opens up any good possibilities for doing this a different way, e.g. with shuffles and blends or something.