below is an implementation of a matrix multiply in AVX2. The machine I am using only supports AVX so I am trying to implement the same configuration with AVX.
However, I am having trouble deciphering really what the differences are, and what would needed to be changed! What in this implementation is specific to AVX2 that would not work with a machine only able to process AVX?
This is a link to all the commands for AVX as well as AVX2 https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX
Thank you for any insight at all!
for (uint64_t i = 0; i < M; i++)
{
for (uint64_t j = 0; j < N; j++)
{
__m256 X = _mm256_setzero_ps();
for (uint64_t k = 0; k < L; k+= 8) {
const __m256 AV = _mm256_load_ps(A+i*L+k);
const __m256 BV = _mm256_load_ps(B+j*L+k);
X = _mm256_fmadd_ps(AV,BV,X);
}
C[i*N+j] = hsum_avx(X);
}
}
Your code uses AVX1 + FMA instructions, not AVX2. It would run ok on an AMD Piledriver, for example. (Assuming the hsum is implemented in a sane way, extracting the high half and then using 128-bit shuffles.).
If your AVX-only CPU doesn't have FMA either, you'd need to use _mm256_mul_ps
and _mm256_add_ps
.
For Intel, AVX2 and FMA were introduced in the same generation, Haswell, but those are different extensions. FMA is available in some CPUs without AVX2.
There is unfortunately even a VIA CPU with AVX2 but not FMA, otherwise AVX2 implies FMA unless you're in a VM or emulator that intentionally has a combination of extensions that real HW doesn't.
MSVC /arch:AVX2
and GCC / clang -march=x86-64-v3
both imply a Haswell feature level, AVX2+FMA+BMI1/2.
(There was an FMA4 extension in some AMD CPUs, with 4 operands (3 inputs and a separate output), Bulldozer through Zen1, after Intel pulled a switcheroo on AMD too late for them to change their Bulldozer design to support FMA3. That's why there's an AMD-only FMA4, and why it wasn't until Piledriver that AMD supported an FMA extension compatible with Intel. But that's part of the dust pile of history now, so usually we just say FMA to reference the extension that's technically called FMA3. See Agner Fog's 2009 blog Stop the instruction set war, and How do I know if I can compile with FMA instruction sets?)
vptest
, although FP in this case does include bitwise instructions like vxorps ymm
). Shuffles are only in-lane (e.g. vshufps ymm
or the new vpermilps
) or with 128-bit granularity (vperm2f128
or vinsertf128
/ vextractf128
). AVX1 also provides VEX encodings of all SSE1..4 instructions including integer, with 3-operand non-destructive. e.g. vpsubb xmm0, xmm1, [rdi]
vpermps
/ vpermd
and vpermq / pd
, and vbroadcastss/sd ymm, xmm
with a register source (AVX1 only had vbroadcastss ymm, [mem]
). Also an efficient vpblendd
immediate integer blend instruction, like vblendps
vfmadd213ps x/ymm, x/ymm, x/ymm/mem
and so on. (And pd and scalar ss/sd version). Also fmsub.. (subtract the 3rd operand), fnmadd.. (negate the product), and even fmaddsub...ps. _mm256_fmadd_ps
will compile to some form of vfmadd...ps
, depending on which input operand the compiler wants to overwrite, and which operand it wants to use as the memory operand.This order of introduction explains the bad choice of intrinsic naming, e.g. _mm256_permute_ps
(immediate) and _mm256_permutevar_ps(data,idx)
(vector control) are AVX1 vpermilps
in-lane permute, with AVX2 getting saddled with _mm256_permutevar8x32_ps(data, idx)
for vpermps dst, idx, data
which only exits in YMM operand-size. AVX-512 gets _mm256_permutexvar_ps(idx, data)
, available in YMM and ZMM sizes. (The asm reference manual only mentions the latter intrinsic, but GCC rejects it unless AVX-512 is enabled.)
So confusingly the intrinsic has an x
for lane-crossing, while the asm mnemonic is just plain.
And note that the AVX2 intrinsic (_mm256_permutevar8x32_ps
) has its operands backwards vs. Intel-syntax assembly, but the AVX-512 intrinsics (_mm256_permutexvar_ps
and _mm512_permutexvar_ps
) have them in the order you'd expect from the asm instruction.
The situation is the same with VPERMD and VPERMQ, with the asm manual only mentioning the AVX-512 intrinsics (_mm256_permutexvar_epi32(idx,data)
, _mm256_permutexvar_epi64(idx,data)
and _mm256_permutex_epi64(data, imm)
), but the intrinsics guide shows those as AVX-512 and AVX2 as 8x32 and 4x64 intrinsics with operands in reversed order. Again, GCC requires using the AVX2 intrinsic if you don't enable AVX-512.