Loop unrolling, Memory Access, and Recursive Throughput

Why does the compiler think it's profitable to load eight elements from one buffer, load 32 elements from the other buffer, do some calculations on eight elements of each buffer, and then load eight elements from the first buffer again instead of loading 32 elements from both buffers upfront?

Consider the following code snippet, which uses a SIMD library (Google Highway):

#include <hwy/highway.h>
namespace hn = hwy::HWY_NAMESPACE;

using T = float;

void MulAddLoop(const T* HWY_RESTRICT mul_array,
                const T* HWY_RESTRICT add_array,
                const size_t size, T* HWY_RESTRICT x_array) {
  const hn::ScalableTag<T> d;
  for (size_t i = 0; i < size; i += hn::Lanes(d)) {
    const auto mul = hn::Load(d, mul_array + i);
    const auto add = hn::Load(d, add_array + i);
    auto x = hn::Load(d, x_array + i);
    x = hn::MulAdd(mul, x, add);
    hn::Store(x, d, x_array + i);
  }
}

The assembly code associated with the main loop is:

        vmovaps ymm0, ymmword ptr [rdi + 4*rdx]
        vmovaps ymm1, ymmword ptr [rcx + 4*rdx]
        vmovaps ymm2, ymmword ptr [rcx + 4*rdx + 32]
        vmovaps ymm3, ymmword ptr [rcx + 4*rdx + 64]
        vmovaps ymm4, ymmword ptr [rcx + 4*rdx + 96]
        vfmadd213ps     ymm1, ymm0, ymmword ptr [rsi + 4*rdx] # ymm1 = (ymm0 * ymm1) + mem
        vmovaps ymmword ptr [rcx + 4*rdx], ymm1
        vmovaps ymm0, ymmword ptr [rdi + 4*rdx + 32]
        vfmadd213ps     ymm0, ymm2, ymmword ptr [rsi + 4*rdx + 32] # ymm0 = (ymm2 * ymm0) + mem
        vmovaps ymmword ptr [rcx + 4*rdx + 32], ymm0
        vmovaps ymm0, ymmword ptr [rdi + 4*rdx + 64]
        vfmadd213ps     ymm0, ymm3, ymmword ptr [rsi + 4*rdx + 64] # ymm0 = (ymm3 * ymm0) + mem
        vmovaps ymmword ptr [rcx + 4*rdx + 64], ymm0
        vmovaps ymm0, ymmword ptr [rdi + 4*rdx + 96]
        vfmadd213ps     ymm0, ymm4, ymmword ptr [rsi + 4*rdx + 96] # ymm0 = (ymm4 * ymm0) + mem
        vmovaps ymmword ptr [rcx + 4*rdx + 96], ymm0
        add     rdx, 32
        add     r8, -4
        jne     .LBB0_8

See the entire example on Godbolt (taken from the highway project).

It seems to be me that the compiler decides to unroll four iterations of the loop and do the following:

Load eight elements from the mul_array buffer
Load 32 elements in four steps from the add_array buffer
Do an FMA on eight elements and store the results in x_array
Load 8 elements from the mul_array buffer
...

If I had done loop unrolling, I would have loaded 32 elements from each mul_array and add_array. Why did the compiler come to the conclusion that its decision was optimal? Are there too few registers for that? I find this particularly intriguing, as the Intel Intrinsics guide states an RT of 0.5 or 1/3 for the memory load, so I would have tried to load two or three elements from one buffer.

Solution

Seems like reasonable instruction-scheduling to me. At least one load from the two read-only arrays as early as possible gets started resolving cache and TLB misses on them.

Scheduling the first FMA (and load from x) before doing all the loads means there's some work to do if the ROB (reorder buffer) and/or scheduler (for un-executed uops) is nearly full with stalled work independent from this loop.

Intel P-cores use a mostly-unified scheduler, but E-cores, and AMD, use separate scheduling queues for each execution port. This is probably not significant since even putting all the pure-loads from one iteration first, before all the FMAs+loads and then all the stores, would still only be a small fraction of the sizes of the scheduling and non-scheduling queues in the CPUs -mtune=generic is tuning for. And this is -march=haswell code-gen anyway, where the scheduler was fully unified; any kind of uop can go in any entry.

The memory source operands for the FMAs are also the first access to the add_array. IDK if clang is intentionally delaying that until after some of the loads from other arrays (perhaps to reduce cache conflict misses?), or to be just before the stores back to x, or just that's how it worked out.

But there's a few cycles of load latency from dispatching a load uop to an execution unit before the result is ready for use, so mostly scheduling the loads several instructions ahead of the instruction consuming their result is good to make out-of-order exec not have to work as hard. (Or to have more capability to spare to hide actual cache misses, and overlap this work with surrounding code.) With 3 loads + 1 store per FMA, that's going to be the bottleneck even in ideal conditions, even moreso than a dot product (2 loads per FMA, with dependency chains for FMAs.)

As @harold points out, clang is only using YMM0..5 here. It's compiling for x86-64 System V, not Windows x64. It might just be a coincidence that XMM6-15 are call-preserved in Windows x64, where the low 16 bytes of any of them would have to get saved/restored on the stack if used. Or its heuristics might favour economizing on number of registers because that can be useful if you need others for constants, or if you are compiling for Windows.

So yeah, if writing by hand, I'd probably be inclined to start with a pure load from each of the three arrays (since FMA can use the memory source as a multiplicand or the addend, that's what the 132 vs. 213 vs. 231 stuff is about). Then four memory-source FMAs, then four stores.

Doing loads before stores means memory-disambiguation doesn't have to work as hard, trying to predict whether a load is reloading a recent store or not. (There is dynamic prediction for this; misprediction leads to a machine_clears.memory_ordering event.)

Also, I'd have avoided an indexed addressing mode for the FMA memory source and the store; that stops Intel P-cores from micro-fusing the load+FMA into a single uop for the front-end (Micro fusion and addressing modes), and it stops Skylake and earlier from using the port-7 store AGU, limiting them to 2 load + store-address uops executed per clock. (https://agner.org/optimize/ and https://uops.info/)

Clang and GCC seem to be totally unaware of un-lamination (indexed addressing modes not staying micro-fused after issue/rename), doing big unrolls while still spending 2 uops per instruction instead of using it to amortize the cost of an extra pointer increment.

MSVC knows the trick of subtracting pointers so one can be non-indexed, but doesn't know how to take advantage: it often uses the non-indexed addressing mode for a pure load, which is always a single uop. The trick is sub rsi, rcx / sub rdi, rcx, and pointer-increment rcx in the loop with add rcx, 128 (or sub rcx, -128 to allow an imm8). So x elements can be referenced as [rcx] or [rcx + 32] or whatever, and mul_array or add_array elements are [rsi + rcx] or [rdi + rcx].

This allows the stores and FMA source operands to be one-register addressing modes, saving code-size for both of them, and more importantly saving front-end uops on Intel and allowing use of the port-7 AGU on Haswell through Skylake. (Ice Lake and later have two full-featured store AGU ports, and store-address uops never run on load ports, but un-lamination is still a thing. https://uica.uops.info/ correctly models it.)

So yes there's a significant missed-optimization bug here, but not in the instruction scheduling (ordering).