There is no intrinsic for __m512 packed bit test (like _mm512_testz_si512
).
What's the best way to do it?
_mm512_test_epi32_mask(v,v) == 0
is the drop-in replacement.
Test-into-mask and then test the mask to get a scalar bool
you can branch on or whatever. The element-size of the test doesn't matter if you only care about whether the whole vector has a non-zero bit anywhere, but element sizes of 8/16/32/64 are available (asm manual / Intrinsic guide).
You can also just use the mask as a 0 or non-zero integer if you don't want to branch on it right away and don't need to convert it to a bool
, or if you want to know where the set bits are (bit-scan or popcount.) Or use it to zero-mask or merge-mask other AVX-512 operations.
__mmask16 mask == _mm512_test_epi32_mask(v,v); // 0 or non-zero integer
if (mask != 0) { // __mmask16 is in practice an alias for uint16_t
// You might have further use for the mask, e.g.
int first_match_index = std::countr_zero(mask);
}
In asm, the test/branch or getting a GPR integer could look like this:
vptestmd k0, zmm1, zmm1 ; mask of elements where zmm1&zmm1 was non-zero.
; branch on it. Or a compiler might use cmovz or setz (create an actual bool)
kortestw k0, k0 ; set integer FLAGS according to k0|k0
jz vec_was_all_zero ; branch if ZF==1
; or get a 0 / non-0 int you can return, or bit-scan to find the first non-zero element
kmovw eax, k0
Or depending on what you want to do with the mask, _mm512_testn_epi32_mask(v,v)
to get NAND instead of AND. testn(v,v) == ~test(v,v)
. But if you just want to test the mask, you could do _mm512_test_epi32_mask(v,v) == 0xFFFF
to check that all 16 elements had a non-zero bit, instead of checking that the testn
result was 0. Actually compilers are bad at this; you need to use _kortestc_mask16_u8(msk,msk)
(intrinsics guide) instead of msk == 0xFFFF
to get compilers to make efficient asm (Godbolt).
kortest
sets the carry flag if the OR result is all-ones, so you actually can test for all-set as cheaply as all-clear for any mask width, so this is efficiently possible even without an immediate operand like you'd use for AVX2 _mm256_movemask_epi8(v) == -1
where a compiler would cmp eax, -1
, which is slightly larger code-size than test eax,eax
.
So it mostly matters to avoid inverting the mask before countr_zero
or whatever; branching can still be done without needing a kmov
to a GPR first, unless you leave it up to current compilers.
AVX-512 compares and tests are only available with a mask register as a destination (k0-k7
), kind of like a compare + vpmovmskb
rolled into one single-uop instruction. (_mm256_movemask_epi8
or ps/pd
. The AVX-512 versions of those, extracting the high bit of each element, are vpmovd2m
(_mm512_movepi32_mask
), available for every element size including 16-bit, e.g. to grab the sign bits of ints or floats.)
After you get a mask, there are two instructions for setting integer FLAGS conditions based on a k
register: kortest
(set FLAGS according to a bitwise OR of 2 masks, or a mask with itself), and AVX512DQ/BW ktest
(... AND of 2 masks ...).
So you can actually test two vectors at once for having any non-zero elements, like
__mmask16 mask1 = _mm512_test_epi32_mask(v1,v1);
__mmask16 mask2 = _mm512_test_epi32_mask(v2,v2);
// or any other condition you want to check, like _mm512_cmple_epu32(x,y)
if (mask1 | mask2) {
// At least one was non-zero; sort out which if it matters.
// Or maybe concatenate them (e.g. kunpckwd) and bit-scan the 32-bit mask
// to find an element index, maybe into memory they were loaded from
}
This would compile to 2x vptestmd
and 1x kortestw
. Same number of uops as vector OR + one vptestmd
+ kortest
in this case; being able to check for any set bits in either of two masks is maybe useful with more complicated compares, like for exact equality.
SSE4 / AVX ptest
into integer FLAGS was always 2 uops on mainstream Intel CPUs anyway (https://uops.info/). Intrinsics like _mm256_testz_si256
expose various FLAGS conditions you can check, in this case ZF==1, getting the compiler to emit an instruction like jz
, jnz
, cmovz ecx, edx
, or setz al
, depending on how you use the resulting bool
.
One of the benefits of legacy-SSE ptest
(not overwriting a source register) doesn't exist with AVX 3-operand instructions, but it was still occasionally useful to get AND or ANDN result when the input vectors weren't compare results or other all-0 / all-1 masks. (compare + ptest + jcc is worse than compare / pmovmskb / macro-fused test+jcc
which is 3 total uops).
AVX-512 is heavily designed around per-element masking (so for example instead of just widening _mm256_xor_si256
to 512, we have _mm512_xor_epi32
or 64
as the no-mask version of _mm512_maskz_and_epi32
. Similarly, the AVX-512 version of ptest
is now a per-element thing into a mask registers. Other than scalar FP compares into EFLAGS like vucomisd
, AVX-512 regularized things so compares/tests always go into mask registers, not EFLAGS like ptest
or general-purpose registers like pmovmskb
.
Related:
How to do AVX-512 integer increment only if element is non zero shows an example of vptestmw
(_mm512_test_epi16_mask(v,v)
) to get a mask of where elements
Missing AVX-512 intrinsics for masks? - since I mentioned mask vector operations like kunpckwd
. Normally it's up to the compiler to decide whether to kmov
masks to integer regs (and back if you use them as AVX-512 masks again).