c++intrinsicslow-levelavx512

Multiply vectors of 32 bit integers, taking only high 32 bits


I want to multiply two 512-bit __m512i vectors of 16 unsigned 32 bit integers together, and take only the high 32 bits from the 64 bit result of the multiplications. Although the Intel intrinsics guide says _mm512_mulhi_epu32 exists, it wouldn't compile on my machine.

The answer here claims that _mm512_srli_epi64(_mm512_mul_epu32(a,b),32) would work, but it doesn't - the problem seems to be that _mm512_mul_epu32 only regards bits 0...31, 64...95, etc., disregarding the values at odd positions.

How can I take the high 32 bits from the results of a 32 bit vector multiplication the quickest?


Solution

  • vpmuludq aka _mm512_mul_epu32 takes the even source 32-bit elements (0, 2, 4, etc)1. This lets it execute efficiently, within each 64-bit chunk feeding the low 32 bits of the inputs to the FP mantissa multipliers. It's a widening aka full multiply, not high-half multiply, so of course it has to ignore some of the input (because no SIMD math instructions have two vector destinations.)

    So you need to use it twice to get all the high-half results you want: once with the even elements, and once with the odd elements at even positions (right shift both input vectors). Then you need to interleave the high halves from those 64-bit elements.

    The trick is doing that efficiently: AVX-512 vpermt2d to pick 32-bit elements from 2 source vectors gets the job done in a single uop. So that's great, especially in a loop that lets the compiler hoist the load of the shuffle-control vector constant. The other options include _mm512_mask_shuffle_epi32 (vpshufd with merge-masking) to copy the high halves down in 1 vector, and merge into the other vector of results, given a merge-control in a k register. (One of the vpmuludq results has the high halves where you want them, because the inputs were right shifted). vmovshdup (_mm512_mask_movehdup_ps) does the same shuffle in 1 less byte of machine code, no immediate needed. It's inconvenient with intrinsics because you'd need to cast __m512i to __m512 with _mm512_castsi512_ps, but should have the same performance.

    Or even store twice, with masking for the 2nd store, but that's probably bad because one of the stores must be misaligned (and thus cache-line crossing for 64-byte stores). Still, it does avoid any more ALU uops.

    The more "obvious" option (like you'd do with AVX2) would be to vpsrld (_mm512_srli_epi64(v,32)) one of them, then vpblendd. But that costs 2 separate ALU uops, and using 512-bit vectors on current CPUs means there are only 2 vector ALU execution ports that can handle them. Also, vpblendd has no AVX-512 version; there are only blends that take the control operand in a k register. (Using shift / AND and OR to merge would be even worse, and would still need a vector constant)

    __m512i mulhi_epu32_512(__m512i a, __m512i b)
    {
        __m512i evens = _mm512_mul_epu32(a,b);
        __m512i odds = _mm512_mul_epu32(_mm512_srli_epi64(a,32), _mm512_srli_epi64(b,32));
        return _mm512_mask_shuffle_epi32(odds, 0x5555, evens, _MM_SHUFFLE(3,3,1,1)); 
    
        // _mm512_mask_movehdup_ps may be slightly more efficient, saving 1 byte of code size
    }
    

    For a stand-alone function, clang optimizes that merge-masked shuffle into a vpermi2d with a vector constant from memory, instead of mov eax, 0x5555 / kmovw k1, eax or whatever. Fewer uops when setup is included, but could cache miss. GCC compiles it as written. https://godbolt.org/z/v4M7PK shows both. For a loop body (with setup hoisted), either way is a single uop, but merge-masked vpshufd has only 1 cycle of latency, vs. 3 for lane-crossing vpermi2d / vpermt2d. (https://uops.info/ and https://agner.org/optimize/)


    Footnote 1: The Q&A you linked either doesn't fully describe the problem and/or solution, or truly only needs 2 numbers (in the bottom of a vector?), not 2 vectors of numbers.