c++assemblyclangx86-64simd

Understanding Clang's SIMD optimization for multiplying a float by an int loop counter


Given the following function

void foo(float* result, int size, float y, float delta) {
    for (int t = 0; t < size; ++t) {
        result[t] = y + delta * t;
    }
}

Clang with -O2 generates the following x86-64 assembly:

.LCPI0_0:
        .long   0
        .long   1
        .long   2
        .long   3
.LCPI0_1:
        .long   4
        .long   4
        .long   4
        .long   4
.LCPI0_2:
        .long   65535
        .long   65535
        .long   65535
        .long   65535
.LCPI0_3:
        .long   1258291200
        .long   1258291200
        .long   1258291200
        .long   1258291200
.LCPI0_4:
        .long   1392508928
        .long   1392508928
        .long   1392508928
        .long   1392508928
.LCPI0_5:
        .long   0x53000080
        .long   0x53000080
        .long   0x53000080
        .long   0x53000080
.LCPI0_6:
        .long   8
        .long   8
        .long   8
        .long   8
foo(float*, int, float, float):
        test    esi, esi
        jle     .LBB0_7
        mov     eax, esi
        cmp     esi, 7
        ja      .LBB0_3
        xor     ecx, ecx
        jmp     .LBB0_6
.LBB0_3:
        mov     ecx, eax
        and     ecx, 2147483640
        movaps  xmm2, xmm1
        shufps  xmm2, xmm1, 0
        movaps  xmm3, xmm0
        shufps  xmm3, xmm0, 0
        mov     edx, eax
        shr     edx, 3
        and     edx, 268435455
        shl     rdx, 5
        movdqa  xmm4, xmmword ptr [rip + .LCPI0_0]
        xor     esi, esi
        movdqa  xmm5, xmmword ptr [rip + .LCPI0_1]
        movdqa  xmm6, xmmword ptr [rip + .LCPI0_2]
        movdqa  xmm7, xmmword ptr [rip + .LCPI0_3]
        movdqa  xmm8, xmmword ptr [rip + .LCPI0_4]
        movaps  xmm9, xmmword ptr [rip + .LCPI0_5]
        movdqa  xmm10, xmmword ptr [rip + .LCPI0_6]
.LBB0_4:
        movdqa  xmm11, xmm4
        paddd   xmm11, xmm5
        movdqa  xmm12, xmm4
        pand    xmm12, xmm6
        por     xmm12, xmm7
        movdqa  xmm13, xmm4
        psrld   xmm13, 16
        por     xmm13, xmm8
        subps   xmm13, xmm9
        addps   xmm13, xmm12
        movdqa  xmm12, xmm11
        pand    xmm12, xmm6
        por     xmm12, xmm7
        psrld   xmm11, 16
        por     xmm11, xmm8
        subps   xmm11, xmm9
        addps   xmm11, xmm12
        mulps   xmm13, xmm2
        addps   xmm13, xmm3
        mulps   xmm11, xmm2
        addps   xmm11, xmm3
        movups  xmmword ptr [rdi + rsi], xmm13
        movups  xmmword ptr [rdi + rsi + 16], xmm11
        paddd   xmm4, xmm10
        add     rsi, 32
        cmp     rdx, rsi
        jne     .LBB0_4
        cmp     ecx, eax
        je      .LBB0_7
.LBB0_6:
        xorps   xmm2, xmm2
        cvtsi2ss        xmm2, ecx
        mulss   xmm2, xmm1
        addss   xmm2, xmm0
        movss   dword ptr [rdi + 4*rcx], xmm2
        inc     rcx
        cmp     rax, rcx
        jne     .LBB0_6
.LBB0_7:
        ret

I'm trying to understand what is going on here. It seems like .LBB0_4 is a loop that covers 8 iterations of the original loop for each iteration (there are 2 mulps instructions and each instruction covers 4 floats and rsi is incremented by 32). The code at the end is probably there to cover the case where size is not divisible by 8. What I'm having trouble with is the rest of the code. What are all these other instructions inside of the .LBB0_4 loop and the constants at the beginning doing? Is there a tool or a compiler argument that can help me understand the result of the SIMD vectorization? Maybe something that turns this back into C++ with SIMD intrinsics?

Also if I change the code to this

void foo(float* result, int size, float y, float delta) {
    for (int t = 0; t < size; ++t) {
        result[t] = y;
        y += delta;
    }
}

Clang generates a much less assembly and loops over 16 values at once.

Edit: I just realized that this version is not vectorized at all and therefore smaller and probably slower.

What is the fastest way to write this code?


Solution

  • As @user555045 pointed out, this is a regression in Clang 19. Clang 18 and earlier auto-vectorize in the obvious way, with the main loop using cvtdq2ps to convert 4 int32_t to 4 float.

    If we look at Clang 19 output for other SIMD ISAs, like AArch64 and AVX-512 (Godbolt), in both cases its using unsigned to float conversions , like AArch64 ucvtf or AVX-512 vcvtudq2ps.

    The bithack stuff is how clang vectorizes u32 to float for ISAs like SSE2 and AVX2 which only provide signed int to/from float.

    So it shot itself in the foot (for AVX2 and earlier) by proving that int t was always non-negative and replacing it with an unsigned temporary, forgetting that it can also work as signed.

    This is a bug you should report on https://github.com/llvm/llvm-project/issues/. Feel free to quote and/or link to this answer. Your C++ source is a good MCVE test-case for a bug report. I don't see an existing duplicate searching on vectorization unsigned float and similar things.


    The loop would run zero iterations if size was signed negative, and it's a < compare so it always leaves the loop without encountering signed-overflow UB. Which is UB anyway so the compiler can just assume it doesn't happen even with a t <= size condition or something. This is what allows promoting int loop counters to pointer width so they can be used as array indices without redoing sign-extension every time.


    What is the fastest way to write this code?

    See Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions? - avoiding the convert and multiply (or FMA) every iteration can be a win, but only if you unroll enough to hide FP latency with multiple vector accumulators. Like float y[UNROLL] = ...; y[0] += UNROLL * delta; y[1] += UNROLL * delta; ... etc.

    With enough unrolling so the compiler uses 6, 8, or 10 vectors for elements of y[], that's barely enough for the latency x throughput product of vaddps on modern x86, typically 3 to 4 cycle latency with 2/clock throughput, so 6 to 8 in flight at once. Some slack allows for imperfect scheduling.

    On Ice Lake or later Intel, and maybe Zen 3 or later AMD, this could achieve 2x 256-bit vaddps and 2x 256-bit vector store per clock. (Skylake and earlier can only do one store per clock. https://uops.info/. The linked Q&A was about a second-order polynomial so it took 2 adds per store with the method of differences. So my Skylake was able to do 2 adds and 1 store per clock with my code. Ice Lake would have store bandwidth to spare with it. Of course assuming the destination was hot in L1d cache, otherwise you most likely just have a memory/cache bandwidth bottleneck even for L2.)

    On Intel E-cores (128-bit wide SIMD), the add to store throughput ratio is also 1:1, so again you'd want the unrolled version that just does SIMD FP adds.

    Integer add, convert, and FMA is only 3 SIMD vector ALU uops per clock, so could actually run at 1 vector result per clock, which is sufficient to keep up with the store throughput on Skylake without strength-reduction to just adds. (If you compile with -march=x86-64-v3 of course, to allow FMA). But then the front-end is a bottleneck since its only 4 uops wide on SKL and earlier. So loop overhead (at least one macro-fused add or sub/jcc, probably also a pointer increment) eats into throughput, and doesn't let out-of-order exec get ahead.