What is the appropriate way to load data for the recent AVX-VNNI and Arm Neon MMLA instructions?
For example, the description of SMMLA is:
Signed 8-bit integer matrix multiply-accumulate. This instruction multiplies the 2x8 matrix of signed 8-bit integer values in the first source vector by the 8x2 matrix of signed 8-bit integer values in the second source vector. The resulting 2x2 32-bit integer matrix [...]
Similarly, the description for _mm256_dpbusd_epi32
is:
Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate signed 16-bit results. Sum these 4 results with the corresponding 32-bit integer in src, and store the packed 32-bit results in dst.
It seems that they all require inputs of the form 2[4]x8 and 8x[4]2. and produce outputs of the form 2[4]x[4]2. How can I efficiently load and store data for these functions?
I see three broad possibilities to use these instructions, none appealing:
An example code for the inner loop (reduction over K) of a small 4xK input matrix A (row-major) and a Kx4 matrix B (column-major) is as follows:
for (size_t k = 0; k < 64; k += 8) {
uint8x8_t low = vld1_u8(row0);
uint8x8_t high = vld1_u8(row1);
uint8x16_t row01x01234567 = vcombine_u8(low, high);
row0 += 8;
row1 += 8;
low = vld1_u8(row2);
high = vld1_u8(row3);
uint8x16_t row23x01234567 = vcombine_u8(low, high);
row2 += 8;
row3 += 8;
low = vld1_u8(col0);
high = vld1_u8(col1);
uint8x16_t col01x01234567 = vcombine_u8(low, high);
col0 += 8;
col1 += 8;
low = vld1_u8(col2);
high = vld1_u8(col3);
uint8x16_t col23x01234567 = vcombine_u8(low, high);
col2 += 8;
col3 += 8;
out01x01 = vmmlaq_u32(out01x01, row01x01234567, col01x01234567);
out01x23 = vmmlaq_u32(out01x23, row01x01234567, col23x01234567);
out23x01 = vmmlaq_u32(out23x01, row23x01234567, col01x01234567);
out23x23 = vmmlaq_u32(out23x23, row23x01234567, col23x01234567);
}
The result is correct, but seems terribly inefficient. The code above is just an example. I actually would use larger tile sizes to maximize register usage.
Packing the matrices A and B is indeed necessary.
For an short outline, consider the PowerPC documentation (red book). https://www.redbooks.ibm.com/abstracts/redp5612.html (page 35). PowerPC has a similar blocked matrix multiply instruction as VNNI and Arm Neon.
I have written such packing function within the matrix multiply code. The packing didn't do any harm to the throughput of the code.