[SOLVED] Packed bit test for _

_mm512_test_epi32_mask(v,v) == 0 is the drop-in replacement.

Test-into-mask and then test the mask to get a scalar bool you can branch on or whatever. The element-size of the test doesn't matter if you only care about whether the whole vector has a non-zero bit anywhere, but element sizes of 8/16/32/64 are available (asm manual / Intrinsic guide).

You can also just use the mask as a 0 or non-zero integer if you don't want to branch on it right away and don't need to convert it to a bool, or if you want to know where the set bits are (bit-scan or popcount.) Or use it to zero-mask or merge-mask other AVX-512 operations.

    __mmask16 mask == _mm512_test_epi32_mask(v,v);   // 0 or non-zero integer
    if (mask != 0) {  // __mmask16 is in practice an alias for uint16_t
          // You might have further use for the mask, e.g.
          int first_match_index = std::countr_zero(mask);
    }

In asm, the test/branch or getting a GPR integer could look like this:

  vptestmd  k0, zmm1, zmm1     ; mask of elements where zmm1&zmm1 was non-zero.

; branch on it.  Or a compiler might use cmovz or setz (create an actual bool)
  kortestw  k0, k0             ; set integer FLAGS according to k0|k0
  jz        vec_was_all_zero   ; branch if ZF==1

; or get a 0 / non-0  int  you can return, or bit-scan to find the first non-zero element
  kmovw     eax, k0

Or depending on what you want to do with the mask, _mm512_testn_epi32_mask(v,v) to get NAND instead of AND. testn(v,v) == ~test(v,v). But if you just want to test the mask, you could do _mm512_test_epi32_mask(v,v) == 0xFFFF to check that all 16 elements had a non-zero bit, instead of checking that the testn result was 0. Actually compilers are bad at this; you need to use _kortestc_mask16_u8(msk,msk) (intrinsics guide) instead of msk == 0xFFFF to get compilers to make efficient asm (Godbolt).

kortest sets the carry flag if the OR result is all-ones, so you actually can test for all-set as cheaply as all-clear for any mask width, so this is efficiently possible even without an immediate operand like you'd use for AVX2 _mm256_movemask_epi8(v) == -1 where a compiler would cmp eax, -1, which is slightly larger code-size than test eax,eax.

So it mostly matters to avoid inverting the mask before countr_zero or whatever; branching can still be done without needing a kmov to a GPR first, unless you leave it up to current compilers.

AVX-512 compares and tests are only available with a mask register as a destination (k0-k7), kind of like a compare + vpmovmskb rolled into one single-uop instruction. (_mm256_movemask_epi8 or ps/pd. The AVX-512 versions of those, extracting the high bit of each element, are vpmovd2m (_mm512_movepi32_mask), available for every element size including 16-bit, e.g. to grab the sign bits of ints or floats.)

After you get a mask, there are two instructions for setting integer FLAGS conditions based on a k register: kortest (set FLAGS according to a bitwise OR of 2 masks, or a mask with itself), and AVX512DQ/BW ktest (... AND of 2 masks ...).

So you can actually test two vectors at once for having any non-zero elements, like

 __mmask16 mask1 = _mm512_test_epi32_mask(v1,v1);
 __mmask16 mask2 = _mm512_test_epi32_mask(v2,v2);
 // or any other condition you want to check, like _mm512_cmple_epu32(x,y)
 if (mask1 | mask2) {
    // At least one was non-zero; sort out which if it matters.
    // Or maybe concatenate them (e.g. kunpckwd) and bit-scan the 32-bit mask
    //  to find an element index, maybe into memory they were loaded from
 }

This would compile to 2x vptestmd and 1x kortestw. Same number of uops as vector OR + one vptestmd + kortest in this case; being able to check for any set bits in either of two masks is maybe useful with more complicated compares, like for exact equality.

SSE4 / AVX ptest into integer FLAGS was always 2 uops on mainstream Intel CPUs anyway (https://uops.info/). Intrinsics like _mm256_testz_si256 expose various FLAGS conditions you can check, in this case ZF==1, getting the compiler to emit an instruction like jz, jnz, cmovz ecx, edx, or setz al, depending on how you use the resulting bool.

One of the benefits of legacy-SSE ptest (not overwriting a source register) doesn't exist with AVX 3-operand instructions, but it was still occasionally useful to get AND or ANDN result when the input vectors weren't compare results or other all-0 / all-1 masks. (compare + ptest + jcc is worse than compare / pmovmskb / macro-fused test+jcc which is 3 total uops).

AVX-512 is heavily designed around per-element masking (so for example instead of just widening _mm256_xor_si256 to 512, we have _mm512_xor_epi32 or 64 as the no-mask version of _mm512_maskz_and_epi32. Similarly, the AVX-512 version of ptest is now a per-element thing into a mask registers. Other than scalar FP compares into EFLAGS like vucomisd, AVX-512 regularized things so compares/tests always go into mask registers, not EFLAGS like ptest or general-purpose registers like pmovmskb.

How to do AVX-512 integer increment only if element is non zero shows an example of vptestmw (_mm512_test_epi16_mask(v,v)) to get a mask of where elements
Missing AVX-512 intrinsics for masks? - since I mentioned mask vector operations like kunpckwd. Normally it's up to the compiler to decide whether to kmov masks to integer regs (and back if you use them as AVX-512 masks again).

Packed bit test for __m512