c++simdavx512

How to load uint8_t "as" 32 bits integer efficiently into a SIMD register?


I have an array of 8 bit integers that I want to process through SIMD instructions. Since those integers will be used along single precision floating point numbers, I actually want to load them in 32 bit lanes instead of the more "natural" 8 bit lanes.

Assuming AVX512, if I have the following array:

std::array< std::uint8_t, 16 > i{ i0, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13, i14, i15 };

I wish to end up with a __m512i register filled with the following bytes:

[ 0, 0, 0, i0,
    0, 0, 0, i2,
    0, 0, 0, i3,
    0, 0, 0, i4,
    0, 0, 0, i5,
    0, 0, 0, i6,
    0, 0, 0, i7,
    0, 0, 0, i8,
    0, 0, 0, i9,
    0, 0, 0, i10,
    0, 0, 0, i11,
    0, 0, 0, i12,
    0, 0, 0, i13,
    0, 0, 0, i14,
    0, 0, 0, i15 ]

What is the best way to achieve that? I currently handroll it using:

_mm512_set_epi32(
    a[0], a[1], a[2], a[3],
    a[4], a[5], a[6], a[7],
    a[8], a[9], a[10], a[11],
    a[12], a[13], a[14], a[15]);

Note: I used AVX512 as an example, ideally I would like a "generic" strategy that can be abstracted on several instruction sets using e.g. Google Highway.


Solution

  • It is possible to do this using Google Highway.

    #include <hwy/highway.h>
    
    namespace hn = hwy::HWY_NAMESPACE;
    
    namespace ns::HWY_NAMESPACE {
    
    template< typename FD, std::integral I >
        requires(std::floating_point< hn::TFromD< FD > > &&
            sizeof(I) <= sizeof(hn::TFromD< FD >))
    auto loadAsFp(FD fd, const std::span< const I >& data)
    {
        using FP = hn::TFromD< FD >; // The target floating point type.
        // A tag to select the proper vector type to load:
        // It will have the same number of lanes as the vector of FP modeled by the tag FD.
        // If it is not a full vector, highway will emulate as best as it can.
        using ID = hn::Rebind< I, FD >;
        auto ld = hn::LoadU(ID{}, data.data());
        if constexpr (sizeof(I) < sizeof(FP))
        {
            // If the integer type is strictly smaller than the target floating type, we must first do a promotion.
            // Note we target a signed type regardless:
            // Highway will be smart enough to figure out if it can ZeroExtend instead of SignExtend.
            using PromotedD = hn::Rebind< hwy::MakeSigned< hn::TFromD< FD > >, FD >;
            return hn::ConvertTo(fd, hn::PromoteTo(PromotedD{}, ld));
    
        }
        else
           return hn::ConvertTo(fd, ld);
    }
    
    } // ns::HWY_NAMESPACE
    

    Then you get the plumbing and compile this for the desired targets. Assuming we compiled for AVX512, this could be used as follows:

    const std::vector< std::uint8_t > is{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
    
    auto vec = ns::N_AVX3::loadAsFp(hwy::N_AVX3::ScalableTag< float >{}, std::span{is});
    

    For AVX512, this is equivalent to:

    auto ld = _mm512_cvtepi32_ps(
        _mm512_cvtepu8_epi32(
            _mm_loadu_epi8(is.data())));