Can FP compares like SSE2 _mm_cmpeq_pd
/ AVX _mm_cmp_pd
be used to compare 64 bit integers?
The idea is to emulate missing _mm_cmpeq_epi64
that would be similar to _mm_cmpeq_epi8
, _mm_cmpeq_epi16
, _mm_cmpeq_epi32
.
The concern is I'm not sure if the comparison is bitwise, or handles floating point specifically, like NAN values are always unequal.
AVX implies availability of SSE4.1 pcmpeqq
is available, in that case you should just use _mm_cmpeq_epi64
.
FP compares treat NaN != NaN, and -0.0 == +0.0
, and if DAZ is set in MXCSR, treat any small integer as zero. (Because exponent = 0 means it represents a denormal, and Denormals-Are-Zero mode treats them as exactly zero on input to avoid possible speed penalties for any operations on any microarchitecture, including for compares. IIRC, modern microarchitectures don't have a penalty for subnormal inputs to compares, but do still for some other operations. In any case, programs built with -ffast-math
set FTZ and DAZ for the main thread on startup.)
So FP compares are not really usable for integers unless you know that some but not all of bits [62:52] (inclusive) will be set.
It's much to use pcmpeqd
(_mm_cmpeq_epi32
) than to hack up some FP bit-manipulation. (Although @chtz suggested in comments you could do 42.0 == (42.0 ^ (a^b))
with xorpd
, as long as the compiler doesn't optimize away the constant and compare against 0.0. That's a GCC bug without -ffast-math).
If you want a condition like at-least-one-match then you need to make sure both halves of a 64-bit element matched, like mask & (mask<<1)
on a movmskps
result, which can compile to lea
/ test
. (You could mask & (mask<<4)
on a pmovmskb
result, but that's slightly less efficient because LEA copy-and-shift can only shift by 0..3.)
Of course "all-matched" doesn't care about element sizes so you can just use _mm_movemask_epi8
on any compare result, and check it against 0xFFFF
.
If you want to use it for a blend with and/andnot/or, you can pshufd
/ pand
to swap halves within 64-bit elements. (If you were feeding pblendvb
or blendvpd
, that would mean SSE4.1 was available so you should have used pcmpeqq
.)
The more expensive one to emulate is SSE4.2 pcmpgtq
, although I think GCC and/or clang do know how to emulate it when auto-vectorizing.