assemblyx86sseintrinsicssse4

Can PTEST be used to test if two registers are both zero or some other condition?


What can you do with SSE4.1 ptest other than testing if a single register is all-zero?

Can you use a combination of SF and CF to test anything useful about two unknown input registers?

What is PTEST good for? You'd think it would be good for checking the result of a packed-compare (like PCMPEQD or CMPPS), but at least on Intel CPUs, it costs more uops to compare-and-branch using PTEST + JCC than with PMOVMSK(B/PS/PD) + macro-fused CMP+JCC.

See also Checking if TWO SSE registers are not both zero without destroying them


Solution

  • No, unless I'm missing something clever, ptest with two unknown registers is generally not useful for checking some property about both of them. (Other than obvious stuff you'd already want a bitwise-AND for, like intersection between two bitmaps).

    To test two registers for both being all-zero, OR them together and PTEST that against itself.


    ptest xmm0, xmm1 produces two results:

    If the second vector is all-zero, the flags don't depend at all on the bits in the first vector.

    It may be useful to think of the "is-all-zero" checks as a NOT(bitwise horizontal-OR()) of the AND and ANDNOT results. But probably not, because that's too many steps for my brain to think through easily. That sequence of vertical-AND and then horizontal-OR does maybe make it easier to understand why PTEST doesn't tell you much about a combination of two unknown registers, just like the integer TEST instruction.

    Here's a truth table for a 2-bit ptest a,mask. Hopefully this helps in thinking about mixes of zeros and ones with 128b inputs.

    Note that CF(a,mask) == ZF(~a,mask).

    a    mask     ZF    CF
    00   00       1     1
    01   00       1     1
    10   00       1     1
    11   00       1     1
    
    00   01       1     0
    01   01       0     1
    10   01       1     0
    11   01       0     1
    
    00   10       1     0
    01   10       1     0
    10   10       0     1
    11   10       0     1
    
    00   11       1     0
    01   11       0     0
    10   11       0     0
    11   11       0     1
    

    Intel's intrinsics guide lists 2 interesting intrinsics for it. Note the naming of the args: a and mask are a clue that they tell you about the parts of a selected by a known AND-mask.

    There's also the more simply-named versions:

    There are AVX2 __m256i versions of those intrinsics, but the guide only lists the all_zeros and mix_ones_zeros alternate-name versions for __m128i operands.

    If you want to test some other condition from C or C++, you should use testc and testz with the same operands, and hope that your compiler realizes that it only needs to do one PTEST, and hopefully even use a single JCC, SETCC, or CMOVCC to implement your logic. (I'd recommend checking the asm, at least for the compiler you care about most.)


    Note that _mm_testz_si128(v, set1(0xff)) is always the same as _mm_testz_si128(v,v), because that's how AND works. But that's not true for the CF result.

    You can check for a vector being all-ones using

    bool is_all_ones = _mm_testc_si128(v, _mm_set1_epi8(0xff));
    

    This is probably no faster (but smaller code-size) than a _mm_cmpeq_epi8 against a vector of all-ones, with the usual _mm_movemask_epi8() == 0xffff, at least if you're branching on it so the scalar cmp is free, fusing with the jcc, so both ways are 3 total uops including the (cmp)/jcc. Except on Zen 1 & 2 where ptest is only 1 uop instead of 2 on other CPUs, then it has an advantage: https://uops.info/. It doesn't avoid the need for a vector constant in this case.

    PTEST does have the advantage that it doesn't destroy either input operand, even without AVX.