Trying to follow a course on Coursera, I tried to optimize a sample C++ code for my Intel i5-8259U
CPU which I believe supports AVX2
SIMD instructions set. Now, AVX2
supplies 16 registers per core (called YMM0
, YMM1
, ..., YMM15
) which are 256-bit wide, meaning that each can process up to 4 double precision floating point numbers simultaneously. Taking advantage of AVX2
SIMD instructions should optimise my code to run up to 4 times faster compared to scalar instructions.
In the linked course, you can try running the same code for numerical integration on an Intel Xeon Phi 7210 (Knights Landing)
processor that supports AVX512
which uses 512-bit wide registers. That means we should expect a speed up of double precision operations by a factor of 8. Indeed the code used by the instructor obtains optimisations up to a factor of 14, which is almost 173% of 8. The additional optimisation is due to OpenMP.
In order to run the same code on my CPU, the only thing I changed was the optimisation flag passed to the Intel compiler: instead of -xMIC-AVX512
, I used -xCORE-AVX2
. The speed up I obtained is only a factor of 2 which is a measly 50% of the expected speed up due only to SIMD vectorisation on 256-bit registers. Compare this 50% to the 173% obtained on an Intel Xeon Phi processor.
Why do I see this drastic loss in performance just by moving from AVX512
to AVX2
? Surely, something other than SIMD optimisation is at play here. What am I missing?
P.S. You can find the referenced code in the folder integral/solutions/1-simd/
here.
TL:DR: KNL (Knight's Landing) is only good at running code specifically compiled for it, and thus gets a much bigger speedup because it stumbles badly running "generic" code.
Coffee Lake only gets a speedup of 2 from 128-bit SSE2 to 256-bit AVX, running both "generic" and targeted code optimally.
Mainstream CPUs like Coffee Lake are one of the targets that "generic" tuning in modern compilers cares about, and they don't have many weaknesses in general. But KNL isn't; ICC without any options doesn't care about KNL
You're assuming that the baseline for your speedups is scalar. But without any options like -march=native
or -xCORE-AVX2
, Intel's compiler (ICC) will still auto-vectorize with SSE2, because that's baseline for x86-64.
-xCORE-AVX2
doesn't enable auto-vectorization, it just gives auto-vectorization even more instructions to play with. Optimization level (including auto-vectorization) is controlled by -O0
/ -O2
/ -O3
, and for FP by strict vs. fast fp-model
. Intel's compiler defaults to full optimization with -fp-model fast=1
(one level below fast=2
), so it's something like gcc -O3 -ffast-math
.
But without extra options, it can only use the baseline instruction-set, which for x86-64 is SSE2. That's still better than scalar.
SSE2 uses 128-bit XMM registers for packed double math, with the same instruction throughput as AVX (on your i5 Coffee Lake) but half the amount of work per instruction. (And it doesn't have FMA, so the compiler couldn't contract any mul+add operations in your source into FMA instructions that way it could with AVX+FMA).
So a factor of 2 speedup on your Coffee Lake CPU is exactly what you should expect for a simple problem that purely bottlenecks on vector mul/add/FMA SIMD throughput (not memory / cache or anything else).
Speedup depends on what your code is doing. If you bottleneck on memory or cache bandwidth, wider registers only help a bit to better utilize memory parallelism and keep it saturated.
And AVX + AVX2 add more powerful shuffles and blends and other cool stuff, but for simple problems with pure vertical SIMD that doesn't help.
So the real question is Why does AVX512 help by more than 4x on KNL? 8 double
elements per AVX512 SIMD instruction on Knight's Landing, up from 2 with SSE2, would give an expected speedup of 4x if instruction throughput was the same. Assuming that total instruction count was identical with AVX512. (Which isn't the case: for the same loop unroll, the amount of vector work per loop overhead increases with wider vectors, plus other factors.)
Hard to say for sure without knowing what source code you were compiling. AVX512 adds some features that may help save instructions, like broadcast memory-source operands instead of requiring a separate broadcast load into a register.
If your problem involved any division, KNL has extremely slow full-precision FP division, and should usually use an AVX512ER approximation instruction (28-bit precision) + a Newton-Raphson iteration (a couple FMA + mul) to double that, giving close to full double
(53-bit significand, including 1 implicit bit). -xMIC-AVX512
enables AVX512ER, and sets tuning options so ICC would actually choose to use it.
(By contrast, Coffee Lake AVX 256-bit division throughput isn't any better than 128-bit division throughput in doubles per cycle, but without AVX512ER there isn't an efficient way to use Newton-Raphson for double
). See Floating point division vs floating point multiplication - the Skylake numbers apply to your Coffee Lake.
AVX / AVX512 can avoid extra movaps
instructions to copy registers, which helps a lot for KNL (every instruction that isn't a mul/add/FMA costs FP throughput, because it has 2-per-clock FMA but only 2-per clock max instruction throughput). (https://agner.org/optimize/)
KNL is based on the Silvermont low-power core (that's how they fit so many cores onto one chip).
By contrast, Coffee Lake has a much more capable front-end and back-end execution throughput: it stall has 2 per clock FMA/mul/add, but 4 per clock total instruction throughput so there's room to run some non-FMA instructions without taking away from FMA throughput.
KNL is built specifically to run AVX512 code. They didn't waste transistors making it efficient running legacy code that wasn't compiled specifically for it (with -xMIC-AVX512
or -march=knl
).
But your Coffee Lake is a mainstream desktop/laptop core that has to be fast running any past or future binaries, including code that only uses "legacy" SSE2 encodings of instruction, not AVX.
SSE2 instructions that write an XMM register leave the upper elements of the corresponding YMM/ZMM register unmodified. (An XMM reg is the low 128 bits of the full vector reg). This would in theory create a false dependency when running legacy SSE2 instructions on a CPU that supports wider vectors. (Mainstream Intel CPUs like Sandybridge-family avoid this with mode transitions, or on Skylake actual false dependencies if you don't use vzeroupper
properly. See Why is this SSE code 6 times slower without VZEROUPPER on Skylake? for a comparison of the 2 strategies).
KNL does apparently have a way to avoid false dependencies: According to Agner Fog's testing (in his microarch guide), he describes it as like the partial-register renaming that P6-family does when you write to integer registers like AL. You only get a partial-register stall when you read the full register. If that's accurate, then SSE2 code should run ok on KNL, because there's no AVX code reading the YMM or ZMM registers.
(But if there were false dependencies, a movaps xmm0, [rdi]
in a loop might have to wait until the last instruction to write xmm0
in the previous iteration finished. That would defeat KNL's modest out-of-order execution ability to overlap independent work across loop iterations and hide load + FP latency.)
There's also the possibility of decode stalls on KNL when running legacy SSE/SSE2 instructions: it stalls on instructions with more than 3 prefixes, including 0F
escape bytes. So for example any SSSE3 or SSE4.x instruction with a REX prefix to access r8..r15 or xmm8..xmm15 will cause a decode stall of 5 to 6 cycles.
But you won't have that if you omitted all -x
/ -march
options, because SSE1/SSE2 + REX is still fine. Just (optional REX) + 2 other prefixes for instructions like 66 0F 58 addpd
.
See Agner Fog's microarch guide, in the KNL chapter: 16.2 instruction fetch and decoding.
OpenMP - if you're looking at OpenMP to use multiple threads, obviously KNL has many more cores.
But even within one physical core, KNL has 4-way hyperthreading as another way (besides out-of-order exec) to hide the high-ish latency of its SIMD instructions. For example, FMA/add/sub latency is 6 cycles on KNL vs. 4 on Skylake/Coffee Lake.
So breaking a problem up into multiple threads can sometimes significantly increase utilization of each individual core on KNL. But on a mainstream big-core CPU like Coffee Lake, its massive out-of-order execution capabilities can already find and exploit all the instruction-level parallelism in many loops, even if the loop body does a chain of things with each independent input.