cassemblyx86sse8-bit

fast multiplication of int8 arrays by scalars


I wonder if there is a fast way of multiplying int8 arrays, i.e.

for(i = 0; i < n; ++i)
    z[i] = x * y[i];

I see that the Intel intrinsics guide lists several SIMD instructions, such as _mm_mulhi_epi16 and _mm_mullo_epi16 that do something like this for int16. Is there something similar for int8 that I'm missing?


Solution

  • Breaking the input into low & hi, one can

    __m128i const kff00ff00 = _mm_set1_epi32(0xff00ff00);
    __m128i lo = _mm_mullo_epi16(y, x);
    __m128i hi = _mm_mullo_epi16(_mm_and_si128(y, kff00ff00), x);
    __m128i z = _mm_blendv_epi8(lo, hi, kff00ff00);
    

    AFAIK, the high bits YY of the YYyy|YYyy|YYyy|YYyy multiplied by 00xx|00xx|00xx|00xx do not interfere with the low 8 bits ??ll, and likewise the product of YY00|YY00 * 00xx|00xx produces the correct 8 bit product at HH00. These two results at the correct alignment need to be blended.

    __m128i x = _mm_set1_epi16(scalar_x);, and __m128i y = _mm_loadu_si128(...);

    An alternative is to use shufb calculating LutLo[y & 15] + LutHi[y >> 4], where unfortunately the shift must be also emulated by _mm_and_si128(_mm_srli_epi16(y,4),_mm_set1_epi8(15)).