I'm new to the world of intrinsics, and I got here because I saw a way to achieve transparent code compilation i.e. what you see is what you get. Also, reproducibility. For a system supporting e.g. AVX2 I know I'll end up with the same instructions at the end, given I use AVX2 intrinsics. This is an important step towards writing HPC libraries which make use of SIMD. Feel free to correct me in my way of thinking.
Now, I have implemented a 3D vector dot product function in three variants in a micro-benchmarking setting. The code has been compiled using the GNU compiler v11.1.0 and run on a machine with a Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz chip and 32 GiB of DDR4 RAM. Single thread read-write memory bandwidth of said system has been measured at ~34 GiB/s by running a DAXPY benchmark.
First, let me present the elementary structures.
struct vector3
{
float data[3] = {};
inline float& operator()(const std::size_t& index) { return data[index]; }
inline const float& operator()(const std::size_t& index) const { return data[index]; }
inline float l2_norm_sq() const { return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; }
};
// strictly speaking, the following is a class of its own that implements a subset of
// the functionality of the std::vector. The motivation is to be able to allocate memory
// without "touching" the data, a requirement that is crucial for "cold" microbenchmarking.
template<class Treal_t>
using vector3_array = std::vector<vector3<Treal_t>>;
The first is my scalar code. I'm compiling it with the flag "-O0".
void dot_product_novec(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
static constexpr auto inc = 6;
static constexpr auto dot_products_per_inc = inc / 3;
const auto stream_size_div = varray.size() * 3 / inc * inc;
const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
auto dot_product_index = std::size_t{};
for (auto index = std::size_t{}; index < varray.size(); index += inc, dot_product_index += dot_products_per_inc)
{
dot_products[dot_product_index] = float_stream[index] * float_stream[index] + float_stream[index + 1] * float_stream[index + 1]
+ float_stream[index + 2] * float_stream[index + 2];
dot_products[dot_product_index + 1] = float_stream[index + 3] * float_stream[index + 3]
+ float_stream[index + 4] * float_stream[index + 4] + float_stream[index + 5] * float_stream[index + 5];
}
for (auto index = dot_product_index; index < varray.size(); ++index)
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
Next up is my auto-vectorized loop. I'm strongly recommending auto-vectorization using the corresponding directive of OpenMP 4.0. Compiled with flags "-O3;-ffast-math;-march=native;-fopenmp".
void dot_product_auto(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
#pragma omp simd safelen(16)
for (auto index = std::size_t{}; index < varray.size(); ++index)
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
Finally, here's my version which has been vectorized using intrinsics. Compiled using "-O3;-ffast-math;-march=native;-mfma;-mavx2".
void dot_product(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
static constexpr auto inc = 6;
static constexpr auto dot_products_per_inc = inc / 3;
const auto stream_size_div = varray.size() * 3 / inc * inc;
const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
auto dot_product_index = std::size_t{};
static const auto load_mask = _mm256_setr_epi32(-1, -1, -1, -1, -1, -1, 0, 0);
static const auto permute_mask0 = _mm256_setr_epi32(0, 1, 2, 7, 3, 4, 5, 6);
static const auto permute_mask1 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 4, 0);
static const auto store_mask = _mm256_set_epi32(0, 0, 0, 0, 0, 0, -1, -1);
for (auto index = std::size_t{}; index < stream_size_div; index += inc, dot_product_index += dot_products_per_inc)
{
// 1. load and permute the vectors
const auto point_packed = _mm256_maskload_ps(float_stream + index, load_mask);
const auto point_permuted_packed = _mm256_permutevar8x32_ps(point_packed, permute_mask0);
// 2. do a multiply
const auto point_permuted_elementwise_sq_packed = _mm256_mul_ps(point_permuted_packed, point_permuted_packed);
// 3. do 2 horizontal additions
const auto hadd1 = _mm256_hadd_ps(point_permuted_elementwise_sq_packed, point_permuted_elementwise_sq_packed);
const auto hadd2 = _mm256_hadd_ps(hadd1, hadd1);
// 4. permute to target position
const auto result_packed = _mm256_permutevar8x32_ps(hadd2, permute_mask1);
// 4. store
_mm256_maskstore_ps(&dot_products[dot_product_index], store_mask, result_packed);
}
for (auto index = dot_product_index; index < varray.size(); ++index) // no opt for remainder loop
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
I've tested the code, so I know it works.
Now, brief details about the microbenchmarking:
I'm modelling the dot product so: 5 * size FLOPs after 5 * size * sizeof(float) transfers i.e code-balance of 4 or computational intensity of 0.25. Using this information, here are the performance results in terms of effective bandwidth:
Questions:
Please excuse the lack of a minimum reproducer, the amount of code would be too much. Thanks a lot for your thoughts and inputs.
Your manually-vectorized code is not particularly efficient.
Try to benchmark the following 2 versions instead.
This one is simpler, and only requires SSE 4.1 instruction set.
inline __m128 loadFloat3( const float* rsi )
{
__m128 xy = _mm_castpd_ps( _mm_load_sd( (const double*)rsi ) );
// Compilers should merge following 2 lines into single INSERTPS with a memory operand
__m128 z = _mm_load_ss( rsi + 2 );
return _mm_insert_ps( xy, z, 0x20 );
}
// Simple version which uses DPPS instruction from SSE 4.1 set
void dotProductsSimple( float* rdi, size_t length, const float* rsi )
{
const float* const rsiEndMinusOne = rsi + ( (ptrdiff_t)length - 1 ) * 3;
const float* const rsiEnd = rsi + length * 3;
for( ; rsi < rsiEndMinusOne; rsi += 3, rdi++ )
{
// Load complete 16 byte vector, discard the W
__m128 v = _mm_loadu_ps( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
if( rsi < rsiEnd )
{
// For the last vector, load exactly 12 bytes.
// Avoids potential crash when loading from out of bounds
__m128 v = loadFloat3( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
}
This one is more complicated, and requires AVX1 support. Probably, going to be slightly faster on most processors.
void dotProductTransposed( float* rdi, size_t length, const float* rsi )
{
constexpr size_t maskAlign8 = ~(size_t)7;
const float* const rsiEndAligned = rsi + ( length & maskAlign8 ) * 3;
const float* const rsiEndMinusOne = rsi + ( (ptrdiff_t)length - 1 ) * 3;
const float* const rsiEnd = rsi + length * 3;
while( rsi < rsiEndAligned )
{
// Load lower halves
__m256 m03, m14, m25;
m03 = _mm256_castps128_ps256( _mm_loadu_ps( rsi ) );
m14 = _mm256_castps128_ps256( _mm_loadu_ps( rsi + 4 ) );
m25 = _mm256_castps128_ps256( _mm_loadu_ps( rsi + 8 ) );
// Load upper halves; VINSERTF128 supports memory operand for the second argument.
m03 = _mm256_insertf128_ps( m03, _mm_loadu_ps( rsi + 12 ), 1 );
m14 = _mm256_insertf128_ps( m14, _mm_loadu_ps( rsi + 16 ), 1 );
m25 = _mm256_insertf128_ps( m25, _mm_loadu_ps( rsi + 20 ), 1 );
rsi += 24;
// Transpose these SIMD vectors
__m256 xy = _mm256_shuffle_ps( m14, m25, _MM_SHUFFLE( 2, 1, 3, 2 ) );
__m256 yz = _mm256_shuffle_ps( m03, m14, _MM_SHUFFLE( 1, 0, 2, 1 ) );
__m256 x = _mm256_shuffle_ps( m03, xy, _MM_SHUFFLE( 2, 0, 3, 0 ) );
__m256 y = _mm256_shuffle_ps( yz, xy, _MM_SHUFFLE( 3, 1, 2, 0 ) );
__m256 z = _mm256_shuffle_ps( yz, m25, _MM_SHUFFLE( 3, 0, 3, 1 ) );
// Now we have 3 SIMD vectors with gathered x/y/z fields of 8 source 3D vectors
// Compute squares
x = _mm256_mul_ps( x, x );
y = _mm256_mul_ps( y, y );
z = _mm256_mul_ps( z, z );
// Add squares
x = _mm256_add_ps( x, y );
x = _mm256_add_ps( x, z );
// Store 8 values
_mm256_storeu_ps( rdi, x );
rdi += 8;
}
// Handle the remainder
for( ; rsi < rsiEndMinusOne; rsi += 3, rdi++ )
{
__m128 v = _mm_loadu_ps( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
if( rsi < rsiEnd )
{
__m128 v = loadFloat3( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
}