I am trying to convert a function from an implementation using intrinsics into standard C++ (to simplify maintenance, portability, etc.). Everything worked fine, except for a loop with stride 2 where bytes at odd positions are gathered into one location and bytes at odd positions are gathered into another location.
Related questions have been addressed using opencl or intrinsics, but I would like to stick to standard c++.
A minimal example of what I am trying to auto-vectorize would be something like this:
void f(const unsigned char *input, const unsigned size, unsigned char *output) {
constexpr unsigned MAX_SIZE = 2000;
unsigned char odd[MAX_SIZE / 2];
unsigned char even[MAX_SIZE / 2];
for (unsigned i = 0; size > i; ++i) {
if (0 == i % 2) {even[i/2] = input[i];}
else {odd[i/2] = input[i];}
}
//for (unsigned i = 0; size > i; i+=2) {
// even[i/2] = input[i];
// odd[i/2] = input[i+1];
//}
for (unsigned i = 0; size / 2 > i; ++i)
{
output[i] = (even[i] << 4) | odd[i];
}
}
Compiling with g++-11.2, the output of -fopt-info-vec-missed is:
minimal.cpp:6:29: missed: couldn't vectorize loop
minimal.cpp:6:29: missed: not vectorized: control flow in loop.
If I change the implementation to the one that is commented out in the code, g++ fails to vectorize because:
minimal.cpp:11:29: missed: couldn't vectorize loop
minimal.cpp:13:24: missed: not vectorized: not suitable for gather load _13 = *_11;
Considering that it is straightforward to implement this with packed shuffle bytes instructions, I am surprised that g++ can't do it.
Is there a way to re-write the loop so that g++ would be able to vectorize it?
Oh, I found @Peter Cordes 's comment and I combined with my initial answer:
https://gcc.godbolt.org/z/bxzsfxPGx
and -fopt-info-vec-missed
doesn't say anything to me
void f(const unsigned char *input, const unsigned size, unsigned char *output) {
constexpr unsigned MAX_SIZE = 2000;
unsigned char odd[MAX_SIZE / 2];
unsigned char even[MAX_SIZE / 2];
for (unsigned i = 0, j = 0; size > i; i += 2, ++j) {
even[j] = input[i];
odd[j] = input[i + 1];
}
for (unsigned i = 0; size / 2 > i; ++i) {
output[i] = (even[i] << 4) | odd[i];
}
}