gccg++simdauto-vectorization

How to autovectorize a loop with access stride 2 with g++ without openCL or intrinsics


I am trying to convert a function from an implementation using intrinsics into standard C++ (to simplify maintenance, portability, etc.). Everything worked fine, except for a loop with stride 2 where bytes at odd positions are gathered into one location and bytes at odd positions are gathered into another location.

Related questions have been addressed using opencl or intrinsics, but I would like to stick to standard c++.

A minimal example of what I am trying to auto-vectorize would be something like this:

void f(const unsigned char *input, const unsigned size, unsigned char *output) {
  constexpr unsigned MAX_SIZE = 2000;
  unsigned char odd[MAX_SIZE / 2];
  unsigned char even[MAX_SIZE / 2];
  for (unsigned i = 0; size > i; ++i) {
    if (0 == i % 2) {even[i/2] = input[i];}
    else {odd[i/2] = input[i];}
  }
  //for (unsigned i = 0; size > i; i+=2) {
  //  even[i/2] = input[i];
  //  odd[i/2] = input[i+1];
  //}
  for (unsigned i = 0; size / 2 > i; ++i)
  {
    output[i] = (even[i] << 4) | odd[i];
  }

}

Compiling with g++-11.2, the output of -fopt-info-vec-missed is:

minimal.cpp:6:29: missed: couldn't vectorize loop
minimal.cpp:6:29: missed: not vectorized: control flow in loop.

If I change the implementation to the one that is commented out in the code, g++ fails to vectorize because:

minimal.cpp:11:29: missed: couldn't vectorize loop
minimal.cpp:13:24: missed: not vectorized: not suitable for gather load _13 = *_11;

Considering that it is straightforward to implement this with packed shuffle bytes instructions, I am surprised that g++ can't do it.

Is there a way to re-write the loop so that g++ would be able to vectorize it?


Solution

  • Oh, I found @Peter Cordes 's comment and I combined with my initial answer:

    https://gcc.godbolt.org/z/bxzsfxPGx

    and -fopt-info-vec-missed doesn't say anything to me

    void f(const unsigned char *input, const unsigned size, unsigned char *output) {
        constexpr unsigned MAX_SIZE = 2000;
        unsigned char odd[MAX_SIZE / 2];
        unsigned char even[MAX_SIZE / 2];
        for (unsigned i = 0, j = 0; size > i; i += 2, ++j) {
            even[j] = input[i];
            odd[j] = input[i + 1];
        }
    
        for (unsigned i = 0; size / 2 > i; ++i) {
            output[i] = (even[i] << 4) | odd[i];
        }
    }