Why adding vmovapd instruction makes simd vectorized code run faster?

I am playing with vectorization for some high performance numerical code, and I noticed that the performance of SIMD vectorization using Intel's SSE, AVX and AVX512 instructions does not scale with length of the vector registers on my laptop. My laptop has Tiger Lake architecture. I was hoping AVX would be around twice as fast as SSE and AVX512 around twice as fast as AVX. Here is a toy example in x86-64 assembly similar to the code I am developing with an instruction commented out:

AVX512test.s

.globl _start
.text
_start:  
  xor %edx, %edx
loop:
  vmovapd   %zmm17, %zmm18
  vfmadd213pd   %zmm5, %zmm6, %zmm18  
  vfmadd213pd   %zmm4, %zmm17, %zmm18 
  vfmadd213pd   %zmm3, %zmm17, %zmm18 
  vfmadd213pd   %zmm2, %zmm17, %zmm18 
  vfmadd213pd   %zmm1, %zmm17, %zmm18 
  vfmadd213pd   %zmm0, %zmm17, %zmm18 
#  vmovapd  %zmm17, %zmm19
  vfmadd213pd   %zmm15, %zmm16, %zmm19
  vfmadd213pd   %zmm14, %zmm17, %zmm19
  vfmadd213pd   %zmm13, %zmm17, %zmm19
  vfmadd213pd   %zmm12, %zmm17, %zmm19
  vfmadd213pd   %zmm11, %zmm17, %zmm19
  vfmadd213pd   %zmm10, %zmm17, %zmm19
  vfmadd213pd   %zmm9, %zmm17, %zmm19 
  vfmadd213pd   %zmm8, %zmm17, %zmm19 
  vfmadd213pd   %zmm7, %zmm17, %zmm19 
  vdivpd    %zmm19, %zmm18, %zmm18
  inc %edx
  cmp $10000000, %edx
  jne loop
  movq $60, %rax
  syscall

I have the same code for AVX where zmm is replaced with ymm and the loop length set to 20000000 and for SSE where zmm is replaced with xmm and the loop length set to 40000000. If I uncomment the commented vmovapd operation, assemble it with as -o AVX512test.o AVX512test.s , link it with ld -o AVX512test.x AVX512.o and run it with time ./AVX512test.x , I get:

SSE

real    0m0.090s
user    0m0.089s
sys 0m0.000s

AVX

real    0m0.050s
user    0m0.049s
sys 0m0.000s

AVX512

real    0m0.058s
user    0m0.058s
sys 0m0.000s

So going from SSE to AVX scales well, but going from AVX to AVX512 does not.

If I instead keep the vmovapd instruction commented out, I get:

SSE


real    0m0.351s
user    0m0.351s
sys 0m0.000s

AVX

real    0m0.189s
user    0m0.189s
sys 0m0.000s

AVX512

real    0m0.109s
user    0m0.109s
sys 0m0.000s

So without the second vmovapd operation the calculation becomes significantly slower, but on the other hand it scales with vectorization as I would expect.

My question is why does removing an operation make the code run slower and why isn't AVX512 faster than AVX with the vmovapd operation?

I tried to modify the code in different ways like using fewer and more instructions inside the loop body and letting all instructions be the same. But the only thing I have found affecting the performance and scaling with vector register length significantly is whether there is a vmovapd instruction among the other instructions or not. I am a bit clueless, but wondering if it can have something to do with out-of-order execution. If possible, I want the code vectorized over 512 bit registers to be around twice as fast as vectorized over 256 bit registers with vmovapd instruction uncommented.

Solution

100 ms is way too little for a good benchmark. Make it at least a second or two. That said, with the instruction you commented out, the instructions targeting zmm19 can run in parallel to the instructions before while without the instruction, they cannot. This explains the difference in total duration.

AVX-512 can run on less ports than AVX and SSE on many microarchitectures. Specifically, AVX and SSE instructions can run on ports p0, p1, and p5, while AVX-512 uses ports p0 and 1 together (another non-SIMD instruction can run on p1 while p0+p1 are used for AVX-512) or p5.

FMA instructions run on ports p0 and p1 on Tigerlake client. This means that with SSE and AVX, two FMA can run per cycle, while only one can run with AVX-512. As AVX-512 has twice the vector width, this means that they perform the same amount of FLOP per cycle, as long as your numerical kernel admits at least two FMA operations in parallel.

This is the case with your code if you keep the move instruction as indicated as then multiple iterations of the loop are independent. As you observe, performance for AVX2 and AVX-512 will then be very similar. But if you comment out the instruction, now the iterations are coupled through a dependency chain and must be executed sequentially. If your code is limited to just one FMA per cycle, AVX-512 will be faster than AVX2.

My recommendation: optimise your code for higher instruction-level parallelism (ILP). AVX-512 is a good choice for this application and you should keep using it. When you execute this code on a server CPU (Xeon Gold class), both p01 and p5 can execute FMA instructions and the code will be faster with AVX-512 once you have rewritten it to permit higher ILP.