What can you do with SSE4.1 ptest
other than testing if a single register is all-zero?
Can you use a combination of SF and CF to test anything useful about two unknown input registers?
What is PTEST good for? You'd think it would be good for checking the result of a packed-compare (like PCMPEQD or CMPPS), but at least on Intel CPUs, it costs more uops to compare-and-branch using PTEST + JCC than with PMOVMSK(B/PS/PD) + macro-fused CMP+JCC.
See also Checking if TWO SSE registers are not both zero without destroying them
No, unless I'm missing something clever, ptest
with two unknown registers is generally not useful for checking some property about both of them. (Other than obvious stuff you'd already want a bitwise-AND for, like intersection between two bitmaps).
To test two registers for both being all-zero, OR them together and PTEST that against itself.
ptest xmm0, xmm1
produces two results:
xmm0 & xmm1
all-zero?(~xmm0) & xmm1
all-zero? (1 if all the set bits in XMM1 are set in XMM0)If the second vector is all-zero, the flags don't depend at all on the bits in the first vector.
It may be useful to think of the "is-all-zero" checks as a NOT(bitwise horizontal-OR())
of the AND and ANDNOT results. But probably not, because that's too many steps for my brain to think through easily. That sequence of vertical-AND and then horizontal-OR does maybe make it easier to understand why PTEST doesn't tell you much about a combination of two unknown registers, just like the integer TEST instruction.
Here's a truth table for a 2-bit ptest a,mask
. Hopefully this helps in thinking about mixes of zeros and ones with 128b inputs.
Note that CF(a,mask) == ZF(~a,mask)
.
a mask ZF CF
00 00 1 1
01 00 1 1
10 00 1 1
11 00 1 1
00 01 1 0
01 01 0 1
10 01 1 0
11 01 0 1
00 10 1 0
01 10 1 0
10 10 0 1
11 10 0 1
00 11 1 0
01 11 0 0
10 11 0 0
11 11 0 1
Intel's intrinsics guide lists 2 interesting intrinsics for it. Note the naming of the args: a
and mask
are a clue that they tell you about the parts of a
selected by a known AND-mask.
_mm_test_mix_ones_zeros (__m128i a, __m128i mask)
: returns (ZF == 0 && CF == 0)
_mm_test_all_zeros (__m128i a, __m128i mask)
: returns ZF
There's also the more simply-named versions:
int _mm_testc_si128 (__m128i a, __m128i b)
: returns CF
. (As Microsoft docs helpfully point out, this is 1
if all the bits set in b
are set in a
; otherwise 0
.)int _mm_testnzc_si128 (__m128i a, __m128i b)
: returns (ZF == 0 && CF == 0)
int _mm_testz_si128 (__m128i a, __m128i b)
: returns ZF
(The intersection is zero.)There are AVX2 __m256i
versions of those intrinsics, but the guide only lists the all_zeros and mix_ones_zeros alternate-name versions for __m128i
operands.
If you want to test some other condition from C or C++, you should use testc
and testz
with the same operands, and hope that your compiler realizes that it only needs to do one PTEST, and hopefully even use a single JCC, SETCC, or CMOVCC to implement your logic. (I'd recommend checking the asm, at least for the compiler you care about most.)
Note that _mm_testz_si128(v, set1(0xff))
is always the same as _mm_testz_si128(v,v)
, because that's how AND works. But that's not true for the CF result.
You can check for a vector being all-ones using
bool is_all_ones = _mm_testc_si128(v, _mm_set1_epi8(0xff));
This is probably no faster (but smaller code-size) than a _mm_cmpeq_epi8
against a vector of all-ones, with the usual _mm_movemask_epi8() == 0xffff
, at least if you're branching on it so the scalar cmp
is free, fusing with the jcc
, so both ways are 3 total uops including the (cmp
)/jcc
. Except on Zen 1 & 2 where ptest
is only 1 uop instead of 2 on other CPUs, then it has an advantage: https://uops.info/. It doesn't avoid the need for a vector constant in this case.
PTEST does have the advantage that it doesn't destroy either input operand, even without AVX.