I have an array of 8 bit integers that I want to process through SIMD instructions. Since those integers will be used along single precision floating point numbers, I actually want to load them in 32 bit lanes instead of the more "natural" 8 bit lanes.
Assuming AVX512, if I have the following array:
std::array< std::uint8_t, 16 > i{ i0, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13, i14, i15 };
I wish to end up with a __m512i
register filled with the following bytes:
[ 0, 0, 0, i0,
0, 0, 0, i2,
0, 0, 0, i3,
0, 0, 0, i4,
0, 0, 0, i5,
0, 0, 0, i6,
0, 0, 0, i7,
0, 0, 0, i8,
0, 0, 0, i9,
0, 0, 0, i10,
0, 0, 0, i11,
0, 0, 0, i12,
0, 0, 0, i13,
0, 0, 0, i14,
0, 0, 0, i15 ]
What is the best way to achieve that? I currently handroll it using:
_mm512_set_epi32(
a[0], a[1], a[2], a[3],
a[4], a[5], a[6], a[7],
a[8], a[9], a[10], a[11],
a[12], a[13], a[14], a[15]);
Note: I used AVX512 as an example, ideally I would like a "generic" strategy that can be abstracted on several instruction sets using e.g. Google Highway.
It is possible to do this using Google Highway.
#include <hwy/highway.h>
namespace hn = hwy::HWY_NAMESPACE;
namespace ns::HWY_NAMESPACE {
template< typename FD, std::integral I >
requires(std::floating_point< hn::TFromD< FD > > &&
sizeof(I) <= sizeof(hn::TFromD< FD >))
auto loadAsFp(FD fd, const std::span< const I >& data)
{
using FP = hn::TFromD< FD >; // The target floating point type.
// A tag to select the proper vector type to load:
// It will have the same number of lanes as the vector of FP modeled by the tag FD.
// If it is not a full vector, highway will emulate as best as it can.
using ID = hn::Rebind< I, FD >;
auto ld = hn::LoadU(ID{}, data.data());
if constexpr (sizeof(I) < sizeof(FP))
{
// If the integer type is strictly smaller than the target floating type, we must first do a promotion.
// Note we target a signed type regardless:
// Highway will be smart enough to figure out if it can ZeroExtend instead of SignExtend.
using PromotedD = hn::Rebind< hwy::MakeSigned< hn::TFromD< FD > >, FD >;
return hn::ConvertTo(fd, hn::PromoteTo(PromotedD{}, ld));
}
else
return hn::ConvertTo(fd, ld);
}
} // ns::HWY_NAMESPACE
Then you get the plumbing and compile this for the desired targets. Assuming we compiled for AVX512, this could be used as follows:
const std::vector< std::uint8_t > is{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
auto vec = ns::N_AVX3::loadAsFp(hwy::N_AVX3::ScalableTag< float >{}, std::span{is});
For AVX512, this is equivalent to:
auto ld = _mm512_cvtepi32_ps(
_mm512_cvtepu8_epi32(
_mm_loadu_epi8(is.data())));