Counter-intuitive results while playing with intrinsics

I'm new to the world of intrinsics, and I got here because I saw a way to achieve transparent code compilation i.e. what you see is what you get. Also, reproducibility. For a system supporting e.g. AVX2 I know I'll end up with the same instructions at the end, given I use AVX2 intrinsics. This is an important step towards writing HPC libraries which make use of SIMD. Feel free to correct me in my way of thinking.

Now, I have implemented a 3D vector dot product function in three variants in a micro-benchmarking setting. The code has been compiled using the GNU compiler v11.1.0 and run on a machine with a Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz chip and 32 GiB of DDR4 RAM. Single thread read-write memory bandwidth of said system has been measured at ~34 GiB/s by running a DAXPY benchmark.

First, let me present the elementary structures.

struct vector3
{
    float data[3] = {};
    inline float& operator()(const std::size_t& index) { return data[index]; }
    inline const float& operator()(const std::size_t& index) const { return data[index]; }
    inline float l2_norm_sq() const { return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; }
};

// strictly speaking, the following is a class of its own that implements a subset of
// the functionality of the std::vector. The motivation is to be able to allocate memory
// without "touching" the data, a requirement that is crucial for "cold" microbenchmarking.
template<class Treal_t>
using vector3_array = std::vector<vector3<Treal_t>>;

The first is my scalar code. I'm compiling it with the flag "-O0".

void dot_product_novec(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
    static constexpr auto inc = 6;
    static constexpr auto dot_products_per_inc = inc / 3;
    const auto stream_size_div = varray.size() * 3 / inc * inc;

    const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
    auto dot_product_index = std::size_t{};
    for (auto index = std::size_t{}; index < varray.size(); index += inc, dot_product_index += dot_products_per_inc)
    {
        dot_products[dot_product_index] = float_stream[index] * float_stream[index] + float_stream[index + 1] * float_stream[index + 1]
            + float_stream[index + 2] * float_stream[index + 2];
        dot_products[dot_product_index + 1] = float_stream[index + 3] * float_stream[index + 3]
            + float_stream[index + 4] * float_stream[index + 4] + float_stream[index + 5] * float_stream[index + 5];
    }
    for (auto index = dot_product_index; index < varray.size(); ++index)
    {
        dot_products[index] = varray[index].l2_norm_sq();
    }
}

Next up is my auto-vectorized loop. I'm strongly recommending auto-vectorization using the corresponding directive of OpenMP 4.0. Compiled with flags "-O3;-ffast-math;-march=native;-fopenmp".

void dot_product_auto(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
    #pragma omp simd safelen(16)
    for (auto index = std::size_t{}; index < varray.size(); ++index)
    {
        dot_products[index] = varray[index].l2_norm_sq();
    }
}

Finally, here's my version which has been vectorized using intrinsics. Compiled using "-O3;-ffast-math;-march=native;-mfma;-mavx2".

void dot_product(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
    static constexpr auto inc = 6;
    static constexpr auto dot_products_per_inc = inc / 3;
    const auto stream_size_div = varray.size() * 3 / inc * inc;
    
    const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
    auto dot_product_index = std::size_t{};
    
    static const auto load_mask = _mm256_setr_epi32(-1, -1, -1, -1, -1, -1, 0, 0);
    static const auto permute_mask0 = _mm256_setr_epi32(0, 1, 2, 7, 3, 4, 5, 6);
    static const auto permute_mask1 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 4, 0);
    static const auto store_mask = _mm256_set_epi32(0, 0, 0, 0, 0, 0, -1, -1);
    
    for (auto index = std::size_t{}; index < stream_size_div; index += inc, dot_product_index += dot_products_per_inc)
    {
        // 1. load and permute the vectors
        const auto point_packed = _mm256_maskload_ps(float_stream + index, load_mask);
        const auto point_permuted_packed = _mm256_permutevar8x32_ps(point_packed, permute_mask0);
        
        // 2. do a multiply
        const auto point_permuted_elementwise_sq_packed = _mm256_mul_ps(point_permuted_packed, point_permuted_packed);
        
        // 3. do 2 horizontal additions
        const auto hadd1 = _mm256_hadd_ps(point_permuted_elementwise_sq_packed, point_permuted_elementwise_sq_packed);
        const auto hadd2 = _mm256_hadd_ps(hadd1, hadd1);
        
        // 4. permute to target position
        const auto result_packed = _mm256_permutevar8x32_ps(hadd2, permute_mask1);
        
        // 4. store
        _mm256_maskstore_ps(&dot_products[dot_product_index], store_mask, result_packed);
    }
    for (auto index = dot_product_index; index < varray.size(); ++index) // no opt for remainder loop
    {
        dot_products[index] = varray[index].l2_norm_sq();
    }
}

I've tested the code, so I know it works.

Now, brief details about the microbenchmarking:

I use a small library which I've written for this purpose: https://gitlab.com/anxiousprogrammer/tixl.
20 warm up runs, 100 timed runs.
fresh allocations in each run for cold microbenchmarking, first touch (zeroing of the first datum in each memory page) of test data prevents measuring of page-faults.

I'm modelling the dot product so: 5 * size FLOPs after 5 * size * sizeof(float) transfers i.e code-balance of 4 or computational intensity of 0.25. Using this information, here are the performance results in terms of effective bandwidth:

no-vec: 18.6 GB/s
auto-vec: 21.3 GB/s
intrinsic-vec: 16.4 GB/s

Questions:

Is my motivation (mentioned in paragraph 1) a sensible one?
Why is my version slower than the scalar code?
Why are they all far from the peak read-write BW of 34 GiB/s?

Please excuse the lack of a minimum reproducer, the amount of code would be too much. Thanks a lot for your thoughts and inputs.

Solution

Your manually-vectorized code is not particularly efficient.
Try to benchmark the following 2 versions instead.

This one is simpler, and only requires SSE 4.1 instruction set.

inline __m128 loadFloat3( const float* rsi )
{
    __m128 xy = _mm_castpd_ps( _mm_load_sd( (const double*)rsi ) );
    // Compilers should merge following 2 lines into single INSERTPS with a memory operand
    __m128 z = _mm_load_ss( rsi + 2 );
    return _mm_insert_ps( xy, z, 0x20 );
}
// Simple version which uses DPPS instruction from SSE 4.1 set
void dotProductsSimple( float* rdi, size_t length, const float* rsi )
{
    const float* const rsiEndMinusOne = rsi + ( (ptrdiff_t)length - 1 ) * 3;
    const float* const rsiEnd = rsi + length * 3;
    for( ; rsi < rsiEndMinusOne; rsi += 3, rdi++ )
    {
        // Load complete 16 byte vector, discard the W
        __m128 v = _mm_loadu_ps( rsi );
        v = _mm_dp_ps( v, v, 0b01110001 );
        _mm_store_ss( rdi, v );
    }
    if( rsi < rsiEnd )
    {
        // For the last vector, load exactly 12 bytes.
        // Avoids potential crash when loading from out of bounds
        __m128 v = loadFloat3( rsi );
        v = _mm_dp_ps( v, v, 0b01110001 );
        _mm_store_ss( rdi, v );
    }
}

This one is more complicated, and requires AVX1 support. Probably, going to be slightly faster on most processors.

void dotProductTransposed( float* rdi, size_t length, const float* rsi )
{
    constexpr size_t maskAlign8 = ~(size_t)7;
    const float* const rsiEndAligned = rsi + ( length & maskAlign8 ) * 3;
    const float* const rsiEndMinusOne = rsi + ( (ptrdiff_t)length - 1 ) * 3;
    const float* const rsiEnd = rsi + length * 3;

    while( rsi < rsiEndAligned )
    {
        // Load lower halves
        __m256 m03, m14, m25;
        m03 = _mm256_castps128_ps256( _mm_loadu_ps( rsi ) );
        m14 = _mm256_castps128_ps256( _mm_loadu_ps( rsi + 4 ) );
        m25 = _mm256_castps128_ps256( _mm_loadu_ps( rsi + 8 ) );
        // Load upper halves; VINSERTF128 supports memory operand for the second argument.
        m03 = _mm256_insertf128_ps( m03, _mm_loadu_ps( rsi + 12 ), 1 );
        m14 = _mm256_insertf128_ps( m14, _mm_loadu_ps( rsi + 16 ), 1 );
        m25 = _mm256_insertf128_ps( m25, _mm_loadu_ps( rsi + 20 ), 1 );
        rsi += 24;

        // Transpose these SIMD vectors
        __m256 xy = _mm256_shuffle_ps( m14, m25, _MM_SHUFFLE( 2, 1, 3, 2 ) );
        __m256 yz = _mm256_shuffle_ps( m03, m14, _MM_SHUFFLE( 1, 0, 2, 1 ) );
        __m256 x = _mm256_shuffle_ps( m03, xy, _MM_SHUFFLE( 2, 0, 3, 0 ) );
        __m256 y = _mm256_shuffle_ps( yz, xy, _MM_SHUFFLE( 3, 1, 2, 0 ) );
        __m256 z = _mm256_shuffle_ps( yz, m25, _MM_SHUFFLE( 3, 0, 3, 1 ) );

        // Now we have 3 SIMD vectors with gathered x/y/z fields of 8 source 3D vectors
        // Compute squares
        x = _mm256_mul_ps( x, x );
        y = _mm256_mul_ps( y, y );
        z = _mm256_mul_ps( z, z );
        // Add squares
        x = _mm256_add_ps( x, y );
        x = _mm256_add_ps( x, z );
        // Store 8 values
        _mm256_storeu_ps( rdi, x );
        rdi += 8;
    }

    // Handle the remainder
    for( ; rsi < rsiEndMinusOne; rsi += 3, rdi++ )
    {
        __m128 v = _mm_loadu_ps( rsi );
        v = _mm_dp_ps( v, v, 0b01110001 );
        _mm_store_ss( rdi, v );
    }
    if( rsi < rsiEnd )
    {
        __m128 v = loadFloat3( rsi );
        v = _mm_dp_ps( v, v, 0b01110001 );
        _mm_store_ss( rdi, v );
    }
}