I've been using the excellent godbolt.org to determine what gcc does and doesn't vectorize: but I can't work out any way of getting it to vectorize a min(X,Y) function into a PMINUQ etc.
Looking at the sse.md machine description language file in the gcc source, I can see a block around lines 12355 onwards that mentions p<maxmin_int><ssemodesuffix>, which looks to me as though it ought to output PMINUQ etc. So I can't see any reason why compiling for this pattern with -msse4 -msse4.1 shouldn't just work.
However, this part of the md also has a "&& " line inside it, which seems (?) to imply that this opcode only works on AVX-style wide targets.
So, I can't tell whether this is a hardware limitation, a compiler/md bug, a godbolt.org problem with -msse4.1, or something else entirely. Can anyone help me narrow this down a bit?
gcc -msse4 -msse4.1 -msse4.2 -O3 -fopt-info-vec-all
#include <stdint.h>
#define MAX_LOOPS 10000
uint64_t in_array[MAX_LOOPS];
uint64_t shift_array[MAX_LOOPS];
void do_max(uint64_t maxval)
{
for (int i=0; i<MAX_LOOPS; i++)
out_array[i] = (in_array[i] < maxval) ? in_array[i] : maxval;
}
godbolt.org tells me I'm getting...
pcmpeqq xmm0, xmm1
pandn xmm0, xmm2
...when I'm hoping for...
pminuq xmm0, xmm1
vpminuq
requires AVX512. (https://www.felixcloutier.com/x86/pminud:pminuq)
SSE4.1 / AVX2 only has pminub/w/d
. Try using arrays with 32-bit elements.