I'm looking for an SSE Bitwise OR between components of same vector. (Editor's note: this is potentially an X-Y problem, see below for the real comparison logic.)
I am porting some SIMD logic from SPU intrinsics. It has an instruction
spu_orx(a)
Which according to the docs
spu_orx: OR word across d = spu_orx(a) The four word elements of vector a are logically Ored. The result is returned in word element 0 of vector d. All other elements (1,2,3) of d are assigned a value of zero.
How can I do that with SSE 2 - 4 involving minimum instruction? _mm_or_ps
is what I got here.
UPDATE:
Here is the scenario from SPU based code:
qword res = spu_orx(spu_or(spu_fcgt(x, y), spu_fcgt(z, w)))
So it first ORs two 'greater' comparisons, then ORs its result. Later couples of those results are ANDed to get final comparison value.
This is effectively doing (A||B||C||D||E||F||G||H) && (I||J||K||L||M||N||O||P) && ...
where A..D are the 4x 32-bit elements of the fcgt(x,y)
and so on.
Obviously vertical _mm_or_ps
of _mm_cmp_ps
results is a good way to reduce down to 1 vector, but then what? Shuffle + OR, or something else?
UPDATE 1
Regarding "but then what?" I perform
qword res = spu_orx(spu_or(spu_fcgt(x, y), spu_fcgt(z, w)))
On SPU it goes like this:
qword aRes = si_and(res, res1);
qword aRes1 = si_and(aRes, res2);
qword aRes2 = si_and(aRes1 , res3);
return si_to_uint(aRes2 );
several times on different inputs,then AND those all into a single result,which is finally cast to integer 0 or 1 (false/true test)
bool any_nonzero = !_mm_testz_si128(v,v);
That would be a good way to horizontal OR + booleanize a vector into a 0/1 integer. It will compile to multiple instructions, and ptest same,same
is 2 uops on its own. But once you have the result as a scalar integer, scalar AND
is even cheaper than any vector instruction, and you can branch on the result directly because it sets integer flags.
#include <immintrin.h>
bool any_nonzero_bit(__m128i v) {
return !_mm_testz_si128(v,v);
}
On Godbolt with gcc9.1 -O3 -march=nehalem:
any_nonzero(long long __vector(2)):
ptest xmm0, xmm0 # 2 uops
setne al # 1 uop with false dep on old value of RAX
ret
This is only 3 uops on Intel for a horizontal OR into a single bit in an integer register. AMD Ryzen ptest
is only 1 uop so it's even better.
The only risk here is if gcc or clang creates false dependencies by not xor-zeroing eax
before doing a setcc
into AL. Usually gcc is pretty fanatical about spending extra uops to break false dependencies so I don't know why it doesn't here. (I did check with -march=skylake
and -mtune=generic
in case it was relying on Nehalem partial-register renaming for -march=nehalem
. Even -march=znver1
didn't get it to xor-zero EAX before the ptest.)
It would be nice if we could avoid the _mm_or_ps
and have PTEST do all the work. But even if we consider inverting the comparisons, the vertical-AND / horizontal-OR behaviour doesn't let us check something about all 8 elements of 2 vectors, or about any of those 8 elements.
e.g. Can PTEST be used to test if two registers are both zero or some other condition?
// NOT USEFUL
// 1 if all the vertical pairs AND to zero.
// but 0 if even one vertical AND result is non-zero
_mm_testz_si128( _mm_castps_si128(_mm_cmpngt_ps(x,y)),
_mm_castps_si128(_mm_cmpngt_ps(z,w)));
I mention this only to rule it out and save you the trouble of considering this optimization idea. (@chtz suggested it in comments. Inverting the comparison is a good idea that can be useful for other ways of doing things.)
We might be able to delay horizontal ORing / booleanizing until after combining some results from multiple vectors. This makes combining more expensive (imul
or something), but saves 2 uops in the vector -> integer stage vs. PTEST.
x86 has cheap vector mask->integer bitmap with _mm_movemask_ps
. Especially if you ultimately want to branch on the result, this might be a good idea. (But x86 doesn't have a ||
instruction that booleanizes its inputs either so you can't just &
the movemask results).
One thing you can do is integer multiply movemask
results: x * y
is non-zero iff both inputs are non-zero. Unlike x & y
which can be false for 0b0101 &
0b1010for example. (Our inputs are 4-bit movemask results and
unsigned` is 32-bit so we have some room before we overflow). AMD Bulldozer family has an integer multiply that isn't fully pipelined so this could be a bottleneck on old AMD CPUs. Using just 32-bit integers is also good for some low-power CPUs with slow 64-bit multiply.
This might be good if throughput is more of a bottleneck than latency, although movmskps
can only run on one port.
I'm not sure if there are any cheaper integer operations that let us recover the logical-AND result later. Adding doesn't work; the result is non-zero even if only one of the inputs was non-zero. Concatenating the bits together (shift+or) is also of course like an OR if we eventually just test for any non-zero bit. We can't just bitwise AND because 2 & 1 == 0
, unlike 2 && 1
.
Horizontal OR of 4 elements takes multiple steps.
The obvious way is _mm_movehl_ps
+ OR, then another shuffle+OR. (See Fastest way to do horizontal float vector sum on x86 but replace _mm_add_ps
with _mm_or_ps
)
But since we don't actually need an exact bitwise-OR when our inputs are compare results, we just care if any element is non-zero. We can and should think of the vectors as integer, and look at integer instructions like 64-bit element ==
. One 64-bit element covers/aliases two 32-bit elements.
__m128i cmp = _mm_castps_si128(cmpps_result); // reinterpret: zero instructions
// SSE4.1 pcmpeqq 64-bit integer elements
__m128i cmp64 = _mm_cmpeq_epi64(cmp, _mm_setzero_si128()); // -1 if both elements were zero, otherwise 0
__m128i swap = _mm_shuffle_epi32(cmp64, _MM_SHUFFLE(1,0, 3,2)); // copy and swap, no movdqa instruction needed even without AVX
__m128i bothzero = _mm_and_si128(cmp64, swap); // both halves have the full result
After this logical inversion, ORing together multiple bothzero
results will give you the AND of multiple conditions you're looking for.
Alternatively, SSE4.1 _mm_minpos_epu16(cmp64)
(phminposuw
) will tell us in 1 uop (but 5 cycle latency) if either qword is zero. It will place either 0
or 0xFFFF
in the lowest word (16 bits) of the result in this case.
If we inverted the original compares, we could use phminposuw
on that (without pcmpeqq
) to check if any are zero. So basically a horizontal AND across the whole vector. (Assuming that it's elements of 0 / -1). I think that's a useful result for inverted inputs. (And saves us from using _mm_xor_si128
to flip the bits).
An alternative to pcmpeqq
(_mm_cmpeq_epi64) would be SSE2 psadbw
against a zeroed vector to get 0 or non-zero results in the bottom of each 64-bit element. It won't be a mask, though, it's 0xFF * 8
. Still, it's always that or 0 so you can still AND it. And it doesn't invert.