Given a series of a pair of int16_t
s. First item in each pair is left sound channel sample, second - is right. I want to make them mono: mono = (left + right) / 2
and don't want to lose even the least bit.
The following program does what I want (I am pretty sure):
#include <type_traits>
#include <cstdint>
#include <fmt/format.h>
#include <fmt/ranges.h>
#include <x86intrin.h>
int main()
{
constexpr auto step = sizeof(__m128i) / sizeof(uint16_t);
alignas(__m128i) uint16_t input[4 * step];
uint16_t i = 0;
for (uint16_t & x : input) {
x = 1 + 2 * i++;
}
alignas(__m256i) uint16_t result[std::extent_v<decltype(input)> / 2];
for (size_t i = 0; i < std::extent_v<decltype(input)>; i += 4 * step) {
__m256 vec0 = _mm256_cvtepi16_epi32(_mm_load_si128((const __m128i *)(input + i + 0 * step)));
__m256 vec1 = _mm256_cvtepi16_epi32(_mm_load_si128((const __m128i *)(input + i + 1 * step)));
__m256i sum01 = _mm256_hadd_epi32(vec0, vec1);
__m256i mean01 = _mm256_srai_epi32(_mm256_permute4x64_epi64(sum01, _MM_SHUFFLE(3, 1, 2, 0)), 1);
__m256 vec2 = _mm256_cvtepi16_epi32(_mm_load_si128((const __m128i *)(input + i + 2 * step)));
__m256 vec3 = _mm256_cvtepi16_epi32(_mm_load_si128((const __m128i *)(input + i + 3 * step)));
__m256i sum23 = _mm256_hadd_epi32(vec2, vec3);
__m256i mean23 = _mm256_srai_epi32(_mm256_permute4x64_epi64(sum23, _MM_SHUFFLE(3, 1, 2, 0)), 1);
_mm256_store_si256((__m256i *)(result + i / 2), _mm256_permute4x64_epi64(_mm256_packs_epi32(mean01, mean23), _MM_SHUFFLE(3, 1, 2, 0)));
}
fmt::println("{}", fmt::join(result, ", "));
}
But code generated by clang
from trunk (for -mavx2
) seems too overloaded by mov
s: https://godbolt.org/z/cc9v1846n
Is it normal and is it not affecting performance notably? How much improvement of performance can I expect to get if I rewrite it into e.g. inline assembly with manual register's management?
First of all, you need to compile with optimization enabled, otherwise the compiler-generated asm is a total disaster, especially with intrinsics which are inline wrapper functions for builtins that need optimization to have their args and return value variables optimized away even after force_inline
.
You could use
pmaddwd
(_mm256_madd_epi16
) with a constant multiplier of set1_epi16(1)
to get 32-bit sums of horizontal pairs with a single uop, instead of with 2 converts and a 3-uop hadd
instruction (2 shuffles plus a vertical add uop: https://uops.info/)
That gives you the __m256i sum01
variable from your version (from one 256-bit load and _mm256_madd_epi16(v, _mm256_set1_epi16(1))
, except with the elements in order, instead of the in-lane behaviour of 256-bit hadd
. So packing it back down to 16-bit elements after shifting can't just use vpackssdw
.
Another alternative:
pavgw
works vertically, but you can probably build 2 inputs for it with less work than what this requires to widen and shuffle. But _mm256_avg_epu16
works on unsigned 16-bit integers and you need signed, You could range-shift to unsigned by XORing with 0x8000 (i.e. subtracting INT16_MIN) and then doing the same thing on the unsigned average to shift it back.
pavgw
does (x + y + 1) >> 1
to be more like round to nearest instead of truncation in the division by 2.
Depending on what you need / want, I'm not sure which of vpmaddwd
or vpavgw
would end up being more efficient; the trick would be in optimizing the lane-crossing shuffles before and/or after.