c++cgccssesse4

How can I get gcc to vectorize code using the SSE4.1 pminuq/pminud/etc opcodes?


I've been using the excellent godbolt.org to determine what gcc does and doesn't vectorize: but I can't work out any way of getting it to vectorize a min(X,Y) function into a PMINUQ etc.

Looking at the sse.md machine description language file in the gcc source, I can see a block around lines 12355 onwards that mentions p<maxmin_int><ssemodesuffix>, which looks to me as though it ought to output PMINUQ etc. So I can't see any reason why compiling for this pattern with -msse4 -msse4.1 shouldn't just work.

However, this part of the md also has a "&& " line inside it, which seems (?) to imply that this opcode only works on AVX-style wide targets.

So, I can't tell whether this is a hardware limitation, a compiler/md bug, a godbolt.org problem with -msse4.1, or something else entirely. Can anyone help me narrow this down a bit?

gcc -msse4 -msse4.1 -msse4.2 -O3 -fopt-info-vec-all

#include <stdint.h>

#define MAX_LOOPS 10000

uint64_t in_array[MAX_LOOPS];
uint64_t shift_array[MAX_LOOPS];

void do_max(uint64_t maxval)
{
    for (int i=0; i<MAX_LOOPS; i++)
        out_array[i] = (in_array[i] < maxval) ? in_array[i] : maxval;
}

godbolt.org tells me I'm getting...

    pcmpeqq xmm0, xmm1
    pandn   xmm0, xmm2

...when I'm hoping for...

    pminuq  xmm0, xmm1

Solution

  • vpminuq requires AVX512. (https://www.felixcloutier.com/x86/pminud:pminuq)

    SSE4.1 / AVX2 only has pminub/w/d. Try using arrays with 32-bit elements.