simdintrinsicsavxavx2

Using a variable to index a simd vector with _mm256_extract_epi32() intrinsic


I am using the AVX intrinsic _mm256_extract_epi32().

I am not entirely sure if I am using it correctly, though, because gcc doesn't like my code, whereas clang compiles it and runs it without issue.

I am extracting the lane based on the value of an integer variable, as opposed to using a constant.

When compiling the following snippet with clang3.8 (or clang4) for avx2, it generates code and uses the vpermd instruction.

#include <stdlib.h>
#include <immintrin.h>
#include <stdint.h>

uint32_t foo( int a, __m256i vec )
{
    uint32_t e = _mm256_extract_epi32( vec, a );
    return e*e;
}

Now, if I use gcc instead, let's say gcc 7.2 then the compiler fails to generate code, with the errors:

In file included from /opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/include/immintrin.h:41:0,
                 from <source>:2:
/opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/include/avxintrin.h: In function 'foo':
/opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/include/avxintrin.h:524:20: error: the last argument must be a 1-bit immediate
   return (__m128i) __builtin_ia32_vextractf128_si256 ((__v8si)__X, __N);
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/include/immintrin.h:37:0,
                 from <source>:2:
/opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/include/smmintrin.h:449:11: error: selector must be an integer constant in the range 0..3
    return __builtin_ia32_vec_ext_v4si ((__v4si)__X, __N);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I have two issues with this:

  1. Why is clang fine with using a variable, and does gcc want a constant?
  2. Why can't gcc make up its mind? First it demands a 1-bit immediate value, and later it wants an integer constant in the range 0..3 and those are different things.

Intels Intrinsics Guide doesn't specify constraints on the index value for _mm256_extract_epi32() by the way, so who's right here, gcc or clang?


Solution

  • Apparently GCC and Clang made a different choice. IMHO GCC has made the right choice by not implementing this for variable indices. Intrinsic _mm256_extract_epi32 doesn't translate to a single instruction. With a variable index this intrinsic might lead to inefficient code, if it is used in a performance critical loop.

    For example, Clang 3.8 needs 4 instructions to implement _mm256_extract_epi32 with a variable index. GCC forces the programmer to think about more efficient code that avoids _mm256_extract_epi32 with variable indices.

    Nevertheless, sometimes it is useful to have a portable (gcc, clang, icc) function, which emulates _mm256_extract_epi32 with variable a index:

    uint32_t mm256_extract_epi32_var_indx(const __m256i vec, const unsigned int i) {   
        __m128i indx = _mm_cvtsi32_si128(i);
        __m256i val  = _mm256_permutevar8x32_epi32(vec, _mm256_castsi128_si256(indx));
        return         _mm_cvtsi128_si32(_mm256_castsi256_si128(val));
    }    
    

    This should compile to three instructions after inlining: two vmovds and a vpermd (gcc 8.2 with -m64 -march=skylake -O3):

    mm256_extract_epi32_var_indx:
      vmovd xmm1, edi
      vpermd ymm0, ymm1, ymm0
      vmovd eax, xmm0
      vzeroupper
      ret
    

    Note that the intrinsics guide describes that the result is 0 for indices >=8 (which is an unusual case anyway). With Clang 3.8, and with mm256_extract_epi32_var_indx, the index is reduced modulo 8. In other words: only the 3 least significant bits of the index are used. Note that Clang 5.0's round trip to memory isn't very efficient too, see this Godbolt link. Clang 7.0 fails to compile _mm256_extract_epi32 with variable indices.

    As @Peter Cordes commented: with a fixed index 0, 1, 2, or 3, only a single pextrd instruction is needed to extract the integer from the xmm register. With a fixed index 4, 5, 6, or 7, two instructions are required. Unfortunately, a vpextrd instruction working on 256-bit ymm registers doesn't exist.


    The next example illustrates my answer:

    A naive programmer starting with SIMD intrinsics might write the following code to sum the elements 0, 1, ..., j-1, with j<8, from vec.

    #include <stdlib.h>
    #include <immintrin.h>
    #include <stdint.h>
    
    uint32_t foo( __m256i vec , int j)
    {   
        uint32_t sum=0;
        for (int i = 0; i < j; i++){
            sum = sum + (uint32_t)_mm256_extract_epi32( vec, i );
        }
        return sum;
    }
    

    With Clang 3.8 this compiles to about 50 instructions with branches and loops. GCC fails to compile this code. Obviously an efficient code to sum these elements is likely based on:

    1. mask out the elements j, j+1, ..., 7, and
    2. compute the horizontal sum.