Effectively calculate horizontal pair avgs in a stream of int16's

Given a series of a pair of int16_ts. First item in each pair is left sound channel sample, second - is right. I want to make them mono: mono = (left + right) / 2 and don't want to lose even the least bit.
The following program does what I want (I am pretty sure):

#include <type_traits>
#include <cstdint>

#include <fmt/format.h>
#include <fmt/ranges.h>

#include <x86intrin.h>

int main()
{
    constexpr auto step = sizeof(__m128i) / sizeof(uint16_t);
    alignas(__m128i) uint16_t input[4 * step];
    uint16_t i = 0;
    for (uint16_t & x : input) {
        x = 1 + 2 * i++;
    }
    alignas(__m256i) uint16_t result[std::extent_v<decltype(input)> / 2];
    for (size_t i = 0; i < std::extent_v<decltype(input)>; i += 4 * step) {
        __m256 vec0 = _mm256_cvtepi16_epi32(_mm_load_si128((const __m128i *)(input + i + 0 * step)));
        __m256 vec1 = _mm256_cvtepi16_epi32(_mm_load_si128((const __m128i *)(input + i + 1 * step)));
        __m256i sum01 = _mm256_hadd_epi32(vec0, vec1);
        __m256i mean01 = _mm256_srai_epi32(_mm256_permute4x64_epi64(sum01, _MM_SHUFFLE(3, 1, 2, 0)), 1);

        __m256 vec2 = _mm256_cvtepi16_epi32(_mm_load_si128((const __m128i *)(input + i + 2 * step)));
        __m256 vec3 = _mm256_cvtepi16_epi32(_mm_load_si128((const __m128i *)(input + i + 3 * step)));
        __m256i sum23 = _mm256_hadd_epi32(vec2, vec3);
        __m256i mean23 = _mm256_srai_epi32(_mm256_permute4x64_epi64(sum23, _MM_SHUFFLE(3, 1, 2, 0)), 1);

        _mm256_store_si256((__m256i *)(result + i / 2), _mm256_permute4x64_epi64(_mm256_packs_epi32(mean01, mean23), _MM_SHUFFLE(3, 1, 2, 0)));
    }
    fmt::println("{}", fmt::join(result, ", "));
}

But code generated by clang from trunk (for -mavx2) seems too overloaded by movs: https://godbolt.org/z/cc9v1846n

Is it normal and is it not affecting performance notably? How much improvement of performance can I expect to get if I rewrite it into e.g. inline assembly with manual register's management?

Solution

First of all, you need to compile with optimization enabled, otherwise the compiler-generated asm is a total disaster, especially with intrinsics which are inline wrapper functions for builtins that need optimization to have their args and return value variables optimized away even after force_inline.

You could use pmaddwd (_mm256_madd_epi16) with a constant multiplier of set1_epi16(1) to get 32-bit sums of horizontal pairs with a single uop, instead of with 2 converts and a 3-uop hadd instruction (2 shuffles plus a vertical add uop: https://uops.info/)

That gives you the __m256i sum01 variable from your version (from one 256-bit load and _mm256_madd_epi16(v, _mm256_set1_epi16(1)), except with the elements in order, instead of the in-lane behaviour of 256-bit hadd. So packing it back down to 16-bit elements after shifting can't just use vpackssdw.

Another alternative:
pavgw works vertically, but you can probably build 2 inputs for it with less work than what this requires to widen and shuffle. But _mm256_avg_epu16 works on unsigned 16-bit integers and you need signed, You could range-shift to unsigned by XORing with 0x8000 (i.e. subtracting INT16_MIN) and then doing the same thing on the unsigned average to shift it back.

pavgw does (x + y + 1) >> 1 to be more like round to nearest instead of truncation in the division by 2.

Depending on what you need / want, I'm not sure which of vpmaddwd or vpavgw would end up being more efficient; the trick would be in optimizing the lane-crossing shuffles before and/or after.