Given the following function
void foo(float* result, int size, float y, float delta) {
for (int t = 0; t < size; ++t) {
result[t] = y + delta * t;
}
}
Clang with -O2
generates the following x86-64 assembly:
.LCPI0_0:
.long 0
.long 1
.long 2
.long 3
.LCPI0_1:
.long 4
.long 4
.long 4
.long 4
.LCPI0_2:
.long 65535
.long 65535
.long 65535
.long 65535
.LCPI0_3:
.long 1258291200
.long 1258291200
.long 1258291200
.long 1258291200
.LCPI0_4:
.long 1392508928
.long 1392508928
.long 1392508928
.long 1392508928
.LCPI0_5:
.long 0x53000080
.long 0x53000080
.long 0x53000080
.long 0x53000080
.LCPI0_6:
.long 8
.long 8
.long 8
.long 8
foo(float*, int, float, float):
test esi, esi
jle .LBB0_7
mov eax, esi
cmp esi, 7
ja .LBB0_3
xor ecx, ecx
jmp .LBB0_6
.LBB0_3:
mov ecx, eax
and ecx, 2147483640
movaps xmm2, xmm1
shufps xmm2, xmm1, 0
movaps xmm3, xmm0
shufps xmm3, xmm0, 0
mov edx, eax
shr edx, 3
and edx, 268435455
shl rdx, 5
movdqa xmm4, xmmword ptr [rip + .LCPI0_0]
xor esi, esi
movdqa xmm5, xmmword ptr [rip + .LCPI0_1]
movdqa xmm6, xmmword ptr [rip + .LCPI0_2]
movdqa xmm7, xmmword ptr [rip + .LCPI0_3]
movdqa xmm8, xmmword ptr [rip + .LCPI0_4]
movaps xmm9, xmmword ptr [rip + .LCPI0_5]
movdqa xmm10, xmmword ptr [rip + .LCPI0_6]
.LBB0_4:
movdqa xmm11, xmm4
paddd xmm11, xmm5
movdqa xmm12, xmm4
pand xmm12, xmm6
por xmm12, xmm7
movdqa xmm13, xmm4
psrld xmm13, 16
por xmm13, xmm8
subps xmm13, xmm9
addps xmm13, xmm12
movdqa xmm12, xmm11
pand xmm12, xmm6
por xmm12, xmm7
psrld xmm11, 16
por xmm11, xmm8
subps xmm11, xmm9
addps xmm11, xmm12
mulps xmm13, xmm2
addps xmm13, xmm3
mulps xmm11, xmm2
addps xmm11, xmm3
movups xmmword ptr [rdi + rsi], xmm13
movups xmmword ptr [rdi + rsi + 16], xmm11
paddd xmm4, xmm10
add rsi, 32
cmp rdx, rsi
jne .LBB0_4
cmp ecx, eax
je .LBB0_7
.LBB0_6:
xorps xmm2, xmm2
cvtsi2ss xmm2, ecx
mulss xmm2, xmm1
addss xmm2, xmm0
movss dword ptr [rdi + 4*rcx], xmm2
inc rcx
cmp rax, rcx
jne .LBB0_6
.LBB0_7:
ret
I'm trying to understand what is going on here. It seems like .LBB0_4
is a loop that covers 8 iterations of the original loop for each iteration (there are 2 mulps
instructions and each instruction covers 4 float
s and rsi
is incremented by 32). The code at the end is probably there to cover the case where size
is not divisible by 8. What I'm having trouble with is the rest of the code. What are all these other instructions inside of the .LBB0_4
loop and the constants at the beginning doing? Is there a tool or a compiler argument that can help me understand the result of the SIMD vectorization? Maybe something that turns this back into C++ with SIMD intrinsics?
Also if I change the code to this
void foo(float* result, int size, float y, float delta) {
for (int t = 0; t < size; ++t) {
result[t] = y;
y += delta;
}
}
Clang generates a much less assembly and loops over 16 values at once.
Edit: I just realized that this version is not vectorized at all and therefore smaller and probably slower.
What is the fastest way to write this code?
As @user555045 pointed out, this is a regression in Clang 19. Clang 18 and earlier auto-vectorize in the obvious way, with the main loop using cvtdq2ps
to convert 4 int32_t
to 4 float
.
If we look at Clang 19 output for other SIMD ISAs, like AArch64 and AVX-512 (Godbolt), in both cases its using unsigned
to float
conversions , like AArch64 ucvtf
or AVX-512 vcvtudq2ps
.
The bithack stuff is how clang vectorizes u32 to float for ISAs like SSE2 and AVX2 which only provide signed int to/from float.
So it shot itself in the foot (for AVX2 and earlier) by proving that int t
was always non-negative and replacing it with an unsigned temporary, forgetting that it can also work as signed.
This is a bug you should report on https://github.com/llvm/llvm-project/issues/. Feel free to quote and/or link to this answer. Your C++ source is a good MCVE test-case for a bug report. I don't see an existing duplicate searching on vectorization unsigned float
and similar things.
The loop would run zero iterations if size
was signed negative, and it's a <
compare so it always leaves the loop without encountering signed-overflow UB. Which is UB anyway so the compiler can just assume it doesn't happen even with a t <= size
condition or something. This is what allows promoting int
loop counters to pointer width so they can be used as array indices without redoing sign-extension every time.
What is the fastest way to write this code?
See Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions? - avoiding the convert and multiply (or FMA) every iteration can be a win, but only if you unroll enough to hide FP latency with multiple vector accumulators. Like float y[UNROLL] = ...;
y[0] += UNROLL * delta;
y[1] += UNROLL * delta;
... etc.
With enough unrolling so the compiler uses 6, 8, or 10 vectors for elements of y[]
, that's barely enough for the latency x throughput product of vaddps
on modern x86, typically 3 to 4 cycle latency with 2/clock throughput, so 6 to 8 in flight at once. Some slack allows for imperfect scheduling.
On Ice Lake or later Intel, and maybe Zen 3 or later AMD, this could achieve 2x 256-bit vaddps
and 2x 256-bit vector store per clock. (Skylake and earlier can only do one store per clock. https://uops.info/. The linked Q&A was about a second-order polynomial so it took 2 adds per store with the method of differences. So my Skylake was able to do 2 adds and 1 store per clock with my code. Ice Lake would have store bandwidth to spare with it. Of course assuming the destination was hot in L1d cache, otherwise you most likely just have a memory/cache bandwidth bottleneck even for L2.)
On Intel E-cores (128-bit wide SIMD), the add to store throughput ratio is also 1:1, so again you'd want the unrolled version that just does SIMD FP adds.
Integer add, convert, and FMA is only 3 SIMD vector ALU uops per clock, so could actually run at 1 vector result per clock, which is sufficient to keep up with the store throughput on Skylake without strength-reduction to just adds. (If you compile with -march=x86-64-v3
of course, to allow FMA). But then the front-end is a bottleneck since its only 4 uops wide on SKL and earlier. So loop overhead (at least one macro-fused add or sub/jcc, probably also a pointer increment) eats into throughput, and doesn't let out-of-order exec get ahead.