I implemented 8-bit integer multiplication for int8 matrix multiplication.
(uint8_t
or int8_t
are the same since it's not widening.)
This is my code, but I think it's really slow.
inline __m512i int8_mul(__m512i a, __m512i b) {
// Convert vectors INT8 to INT16
__m512i a_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(a));
__m512i a_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(a, 1));
__m512i b_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(b));
__m512i b_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(b, 1));
// Multiply vectors in INT16
__m512i mul_lo = _mm512_mullo_epi16(a_lo, b_lo);
__m512i mul_hi = _mm512_mullo_epi16(a_hi, b_hi);
// Combine results
__m512i result = _mm512_setzero_si512();
result = _mm512_inserti64x4(result, _mm512_cvtepi16_epi8(mul_lo), 0);
result = _mm512_inserti64x4(result, _mm512_cvtepi16_epi8(mul_hi), 1);
return result;
}
Is there any other method to get much more performance?
p.s. Why doesn't intel support multiplying in epi8?
AMX has i8 dot-products (https://www.felixcloutier.com/x86/tdpbssd:tdpbsud:tdpbusd:tdpbuud). AVX-VNNI has vpdpbusd
for dot-products of i8 x u8 (https://www.felixcloutier.com/x86/vpdpbusd). This is available even on Alder Lake-family without AVX-512. AVX-VNNI only seems to have i8 x u8 (like pmaddubsw
) and i16 x i16. Not one with both inputs being i8. And those are only widening dot-products, not pure vertical c[i] = a[i] * b[i]
without summing.
(Widening is why signed vs. unsigned matters; in your case where you're truncating the results back to 8-bit, it doesn't matter whether you sign-extend or zero-extend to 16 bits. The low 8 of the result doesn't depend on the interpretation of the MSB as 2^n-1 or -2^n-1)
For a matmul, depending on your data layout, you potentially can use vpmaddubsw
to sum horizontal pairs. Maybe you need to feed it with an unpack lo/hi. If you want 8-bit elements in your result, you can still truncate the result to 8-bit, otherwise its mixed signed / unsigned treatment of its inputs is a problem. If you need a horizontal sum, you can zero the high bytes of each i16 element and use psadbw
against zero to sum the 8 bytes in a u64 (four of them holding useful data).
This is still a factor of 4 worse throughput than a hypothetical single-uop vpmullb
would be, assuming a CPU with 2x 512-bit multiply units. It's 4 uops that all need to run on vector execution units. Assuming you're using this in code which bottlenecks on vector ALU throughput, not e.g. memory, total front-end throughput, or latency bottlenecks, otherwise the potential speedup from a dedicated multiply instruction would be different (probably lower, especially for memory bottlenecks).
Beware that current clang (19) pessimizes this, using 3 instructions instead of a merge-masking vpshufb
. It compiles as expected with GCC. I haven't looked at how it inlines into loops with either of those, or with MSVC.
This needs more constants, including a non-simple one that compilers will load from .rodata
instead of mov-immediate + broadcast. So use the other version if this isn't inside a tight loop.
// UNTESTED
__m512i int8_mul_masked_pshufb(__m512i a, __m512i b)
{
__m512i a_odd_hi = _mm512_and_si512(a, _mm512_set1_epi16(0xff00));
// Multiply vectors in INT16
__m512i mul_even = _mm512_mullo_epi16(a, b); // with high garbage
__m512i mul_odd = _mm512_maddubs_epi16(a_odd_hi, b); // at the bottom of i16 elements, unlike previous version
// shift left by 1 byte. Zeroing the low half of each word like vpsllw, although that doesn't matter.
__m128i shuflane = _mm_set_epi8(14,-1, 12,-1, 10,-1, 8,-1,
6,-1, 4,-1, 2,-1, 0,-1);
__m512i shuf = _mm512_broadcast_i32x4(shuflane);
// merge-masking, keeping the even bytes of mul_even, replacing the high byte of each word with the byte from mul_odd
__m512i result = _mm512_mask_shuffle_epi8(mul_even, 0xAAAAAAAAAAAAAAAA, mul_odd, shuf);
return result;
}
See the text below for a simpler version and then further optimizations which led to this.
GCC compiles it to four instructions as expected (plus constant setup). I've indented all the stuff that can get hoisted out of loops, but if not will have to run every time. (Godbolt)
# GCC14.2 -O3
int8_mul_masked_pshufb:
mov eax, -16711936 # 0xff00ff00
vpbroadcastd zmm2, eax
movabs rax, -6148914691236517206 # 0xAAAA...
vpandd zmm2, zmm0, zmm2
vpmullw zmm0, zmm0, zmm1
kmovq k1, rax
vpmaddubsw zmm2, zmm2, zmm1
vbroadcasti32x4 zmm1, XMMWORD PTR .LC1[rip]
vpshufb zmm0{k1}, zmm2, zmm1
ret
If you can't use vpmaddubsw
(i8 x u8, summing pairs into i16 with saturation), yes, unpacking to 16-bit is the way to go.
Unpacking to odd/even elements (srli_epi16(v, 8)
, and and(v, set1_epi16(0x00ff)
) is cheaper than 3 shuffles per input, and lets you re-pack with shift / blend instead of 3 shuffles.
(You wrote it with 4 including _mm512_inserti64x4(zero, cvt(lo), 0)
. You could have started with __m512i result = _mm512_castsi256_si512(cvt(lo))
. A good compiler might have optimized away that first _mm512_inserti64x4
, but some compilers take intrinsics more literally than others.)
And since high garbage doesn't affect the result of a multiply (partial products are added, and carry only propagates low to high), we don't even have to isolate the even elements in the bottom halves of each i16 element.
We can also save a shift at the end by producing the odd products in the high half of each i16 by doing (a_odd<<8) * b_odd
, so actually we produce a_odd_hi
by just clearing the low bits, leaving the odd byte where it is.
// UNTESTED, let me know if there are any silly mistakes that need fixing
#include <immintrin.h>
#include <stdint.h>
//inline
__m512i int8_mul(__m512i a, __m512i b)
{
__m512i a_even = a; // _mm512_and_si512(a, _mm512_set1_epi16(0x00ff)); // not needed, high garbage is fine
__m512i a_odd_shifted = _mm512_and_si512(a, _mm512_set1_epi16(0xff00));
__m512i b_even = b;
__m512i b_odd_lo = _mm512_srli_epi16(b, 8);
// Multiply vectors in INT16
__m512i mul_even = _mm512_mullo_epi16(a_even, b_even);
__m512i mul_odd = _mm512_mullo_epi16(a_odd_shifted, b_odd_lo);
// Combine results
// blend using the same vector constant we already needed, instead of a k mask
// first source operand is a variable not needed later so it can be overwritten
__m512i result = _mm512_ternarylogic_epi32(mul_even, _mm512_set1_epi16(0xff00), mul_odd, 0xB8); // 0xB8: B ? C : A
return result;
// alternate version using a mask
// __m512i result = _mm512_mask_blend_epi8(0xAAAAAAAAAAAAAAAA, mul_even, mul_odd);
// another alternative:
// __m512i result = _mm512_mask_mov_epi8(mul_even, 0xAAAAAAAAAAAAAAAA, mul_odd);
}
(Variable naming: I also considered a_odd_hi
to reflect the fact that we left the bits we want in the high half of the word. a_odd_shifted
is supposed to mean that a_odd
is the u8
or i8
value we want, and it's in a u16 as a_odd << 8
. But it's not great because we actually didn't shift to get it there. It's a tiny piece of code all wrapped up in a function so it's basically fine, but I'm still not happy with any of my ideas for variable names. a_oddx256
is another option that seems even clunkier.)
GCC and Clang both do pretty reasonably:
# clang19 -O3 -march=x86-64-v4
.LCPI0_1:
.long 4278255360 # in .rodata
# in .text
int8_mul:
vpsrlw zmm2, zmm1, 8
vpmullw zmm1, zmm1, zmm0
vpandd zmm0, zmm0, dword ptr [rip + .LCPI0_1]{1to16}
vpmullw zmm0, zmm2, zmm0
vpternlogd zmm0, zmm1, dword ptr [rip + .LCPI0_1]{1to16}, 228
ret
GCC materializes the constant from mov-immediate + vpbroadcastd zmm1, eax
(after wasting an instruction to mov one input to a different vector reg), clang chooses to broadcast-load it twice. When inlining, both should hoist the constant setup out of the loop and just use register source operands.
I could have used a mask constant for both the output blending and the input masking, using a_odd_shifted = _mm512_maskz_mov_epi8(0xAAAAAAAAAAAAAAAA, a);
. That would still only be one constant to set up, but being 64-bit it would need to be movabs rcx, imm64
/ kmovq k1, rcx
or something, vs. mov ecx, imm32
/ vpbroadcastd zmm2, ecx
. Masked vmovdqu8
still takes a vector execution unit, same as vpandd
.
vpmaddubsw
!GCC compiles your original as written, with 6 shuffles on the 2 inputs, and 3 shuffles to re-pack. (It does optimize away the insert into a zeroed vector).
But clang does something completely different, re-vectorizing it a lot more like my version.
# clang19 for the original version!!
.LCPI1_1:
.short 255
int8_mul_shuffle:
vpbroadcastw zmm2, word ptr [rip + .LCPI1_1] # set1 (0x00ff)
vpandq zmm3, zmm2, zmm0 # a_even
vpmaddubsw zmm3, zmm1, zmm3 # b_even * a_even (plus 0 * b_odd = 0)
vpandnq zmm0, zmm2, zmm0 # a_odd
vpmaddubsw zmm0, zmm1, zmm0 # b_odd * a_odd (plus 0 * b_even = 0)
vpsllw zmm0, zmm0, 8 # result_odd <<= 8
vpternlogq zmm0, zmm3, zmm2, 248 # blend
ret
Not counting the constant setup and ret
, this is 6 instructions. My version was 5, and the first multiply could start right away, without either input having to go through an and
first. Also, mine has slightly better critical-path latency: from a being ready, it does vpandq
/ vpmaddubsw
/ vpsllw/
vpternlogqall in a dependency chain. Mine lets the shift and
and` for the odd elements run in parallel.
vpmaddubsw
is a fun way of handling the odd/even, though, since it does (x_even * y_even) + (x_odd * y_odd)
in each pair, sign-extending the first input and zero-extending the second. So zeroing the odd or even elements of one input means the corresponding elements are multiplied by 0. IDK if it tends to use any less power than vpmullw
, being only an 8-bit widening multiply instead of 16-bit non-widening.
Clang's version could use vpmullw
instead of vpandq
/ vpmaddubsw
for the even x even product. That would be a drop-in replacement since vpmaddubsw
also produces high garbage so they blend like I do, not just OR. ANDing one input and shifting the other before the second vpmullw
doesn't save instructions but would shorten latency, assuming both are ready at the same time, and is never(?) worse.
I also considered using AVX-512VBMI vpermt2b
to select elements from both halves of the product, but that needs a non-simple vector constant and is multiple uops on Intel CPUs (unlike Zen 4). vpermt2b
would be directly usable after and
/ vpmullw
/ vpmaddubsw
and would be efficient on AMD but not Intel. It would need a 64-byte constant in .rodata
.
Ice Lake and later have vpermb
which they run as a single uop (https://uops.info/), so a merge-masked vpermb
of the odd elements merging with the vector holding even elements could also get the job done efficiently on Intel as well as AMD. Combined with clang's trick of using vpmaddubsw
with less pre-processing of inputs, that actually gets us down to 4 uops.
But with odd/even it doesn't need to be a lane-crossing shuffle, just AVX-512BW vpshufb zmm{k}, zmm, zmm
is fine. (vpslldq zmm
in-lane byte-shift doesn't support masking.) So our shuffle-control can just be broadcast from a 16-byte constant. See int8_mul_masked_pshufb
near the start of this answer, and in the Godbolt link.
Merge-masking makes vpshufb
3-cycle latency on Intel (vs. 1 without masking), but that's still better than vpermb
at 5 or 3 cycles. Both are limited to the same execution ports with shuffle units.
Zen 4 runs vpshufb
with 2-cycle latency with or without masking, even for the XMM version, at least according to testing by https://uops.info/ using vandpd
or vpand
to create a dependency chain. https://agner.org/optimize/ confirms that latency=2 for that and a couple other in-lane shuffles despite most still being 1.
Zen 5 latencies from InstLat show all SIMD instructions have at least 2 cycle latency on that later uarch, even bitwise boolean. (But 2/clock throughputs for ZMM multiplies, and an impressive 4/clock for vpandq
and even some ZMM shuffles, although not vpshufb
. Unlike Intel where only two execution ports, p0 and p5, have any 512-bit execution units.) Update: that InstLat result has been taken down since presumably it was wrong somehow. Another one shows 1-cycle latency for instructions like vpaddb
and vpor
on Zen 5, which is much less surprising. Still 2/clock multiply throughput with 3-cycle latency.