x86matrix-multiplicationsimdavxavx2

Differences between AVX and AVX2


below is an implementation of a matrix multiply in AVX2. The machine I am using only supports AVX so I am trying to implement the same configuration with AVX.

However, I am having trouble deciphering really what the differences are, and what would needed to be changed! What in this implementation is specific to AVX2 that would not work with a machine only able to process AVX?

This is a link to all the commands for AVX as well as AVX2 https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX

Thank you for any insight at all!

 for (uint64_t i = 0; i < M; i++)
     {
         for (uint64_t j = 0; j < N; j++)
         {
             __m256 X = _mm256_setzero_ps();
             for (uint64_t k = 0; k < L; k+= 8) {
                 const __m256 AV = _mm256_load_ps(A+i*L+k);
                 const __m256 BV = _mm256_load_ps(B+j*L+k);
                 X = _mm256_fmadd_ps(AV,BV,X);
             }
             C[i*N+j] = hsum_avx(X);
         }
     }


Solution

  • Your code uses AVX1 + FMA instructions, not AVX2. It would run ok on an AMD Piledriver, for example. (Assuming the hsum is implemented in a sane way, extracting the high half and then using 128-bit shuffles.).

    If your AVX-only CPU doesn't have FMA either, you'd need to use _mm256_mul_ps and _mm256_add_ps.


    For Intel, AVX2 and FMA were introduced in the same generation, Haswell, but those are different extensions. FMA is available in some CPUs without AVX2.

    There is unfortunately even a VIA CPU with AVX2 but not FMA, otherwise AVX2 implies FMA unless you're in a VM or emulator that intentionally has a combination of extensions that real HW doesn't.

    MSVC /arch:AVX2 and GCC / clang -march=x86-64-v3 both imply a Haswell feature level, AVX2+FMA+BMI1/2.

    (There was an FMA4 extension in some AMD CPUs, with 4 operands (3 inputs and a separate output), Bulldozer through Zen1, after Intel pulled a switcheroo on AMD too late for them to change their Bulldozer design to support FMA3. That's why there's an AMD-only FMA4, and why it wasn't until Piledriver that AMD supported an FMA extension compatible with Intel. But that's part of the dust pile of history now, so usually we just say FMA to reference the extension that's technically called FMA3. See Agner Fog's 2009 blog Stop the instruction set war, and How do I know if I can compile with FMA instruction sets?)


    This order of introduction explains the bad choice of intrinsic naming, e.g. _mm256_permute_ps (immediate) and _mm256_permutevar_ps (vector control) are AVX1 vpermilps in-lane permute, with AVX2 getting saddled with _mm256_permutexvar_ps. So confusingly the intrinsic has an x for lane-crossing, while the asm mnemonic is just plain.