Background:
I am writing SGEMM
and the performance was bad. Look at the disassembly and perf
I found instructions like this:
...
| Time | Instruction
|------|-------------------------------------------------
| 0.46 | vmovaps ymm3,YMMWORD PTR [rdx]
| 2.13 | vmovaps ymm0,YMMWORD PTR [rdx+0x20]
| 0.16 | add rax,0x40
| 0.16 | add rdx,0x40
| 0.76 | vbroadcastss ymm2,DWORD PTR [rsi]
| 0.36 | vbroadcastss ymm1,DWORD PTR [r15]
| 0.56 | vbroadcastss ymm7,DWORD PTR [r14]
| 0.10 | vmovaps ymm5,ymm3
| 0.52 | vmovaps ymm13,ymm3
| 0.22 | vbroadcastss ymm4,DWORD PTR [r13+0x0]
| 2.42 | vfmadd213ps ymm13,ymm1,YMMWORD PTR [rax+0x1fc0]
| 0.40 | vfmadd213ps ymm5,ymm2,YMMWORD PTR [rax-0x40]
| 0.42 | vmovaps ymm9,ymm3
| 17.12| vfmadd213ps ymm2,ymm0,YMMWORD PTR [rax-0x20] <--
| 16.73| vfmadd213ps ymm1,ymm0,YMMWORD PTR [rax+0x1fe0] <--
...
Note that the last two fma
takes up 30+% of the running time. My assumption is that these two instructions are slow due to the memory operand in them. Other fma
in the assembly are way faster.
I've tried with different way of writing this program. Sometimes, the compiler generates fma
with three operand being ymm
registers while other times it generates fma
with memory operand.
Is there any way to reason about the code generation or is it purely a myth? Also, is there anyway to force the compiler to generate code with three ymm
registers without writing inline assembly code? Intrinsic doesn't seems to help with that.
CPU: i5-6200u (should probably get a better one)
Compiler: gcc version 14.0.1 20240413 (experimental) (GCC)
Optimization flags: g++ -O3 -march=native -ffast-math gemm.cpp -std=gnu++2a
UPD 0: added compiler information etc.
UPD 1: Change error in link
Generally any good compiler will try to fold a load into a memory source operand for an ALU instruction if the load result is only used once. This saves front-end (decode and issue/rename) bandwidth since the load + FMA uops can stay micro-fused in most of the pipeline other than the scheduler and execution units.
If you're cache-blocking so the same vectors of row data get re-used with multiple broadcasted columns, you'd expect the compiler to load once into a register.
If there aren't enough registers available, it might favour reloading this way, instead of spilling/reloading a variable needed later which would increase the length of a dependency chain.
You haven't mentioned a language, but in GNU C you could use asm("" : "+x"(foo))
as a black-box for the optimizer to force the __m256
variable foo
to be in a register, with the "+"
telling the compiler that it's modified by your asm template (which is the empty string, zero asm instructions).
But even if you force the compiler to do a separate vmovups
load, your vfmadd
instructions will still get the blame (from hardware performance counters like cpu_clk_unhalted.thread
on my Skylake (which Linux perf
will use for the cycles
event on CPUs that have that event). The HW PMU typically "blames" the instruction that was waiting for the slow result, not the instruction that was slow to produce it. (It has to pick one out of many that were in flight when the event counter rolled over.)
A well-tuned SGEMM should bottleneck on FMA execution-unit throughput, not memory, even for large matrices that don't all fit in L1d or L2 cache. This isn't easy to achieve.
So I guess you're trying to see whether it's the load or the actual FMA that's slow? Check how close you're coming to 2 FMA instructions per clock, or how much of the time the appropriate ports are busy vs. idle. On Intel, uops_dispatched_port.port_0
and uops_dispatched_port.port_1
for 256-bit FMAs. Check https://uops.info/ for port number details.