Using google benchmark to benchmark the following functions:
int sumElements(int *arr, int count)
{
int sum = 0;
for (int i = 0; i < count; ++i) {
sum += arr[i];
}
return sum;
}
int sumElementsUnrolled(int *arr, int count)
{
int sumA = 0;
int sumB = 0;
for (int i = 0; i < count; i += 2) {
sumA += arr[i];
sumB += arr[i + 1];
}
return sumA + sumB;
}
Benchmark results:
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
sumElements 965 ns 963 ns 723250
sumElementsUnrolled 641 ns 641 ns 1091406
I tried to benchmark these functions using google benchmark, and got a noticeable benefit at -O0 and -O1. I learned in several places that this is likely due to breaking dependency chains, which, if I understand correctly, leads to fewer pipeline stalls as the CPU is executing my instructions. Looking at the generated assembly at -O1, this makes sense: the unrolled version uses different registers for each sum variable, and the dependency chain is broken.
Questions:
perf stat
shows a much higher rate of stalled-cycles-backend
for the unrolled version, but I don't really know if that's related.The benchmarking code is very simple:
#define SIZE 4096
int *initArray(int size)
{
int *arr = (int*)malloc(size * sizeof(int));
for (int i = 0; i < size; ++i) {
arr[i] = rand();
}
return arr;
}
void doBenchmark(benchmark::State &s)
{
int *arr = initArray(SIZE);
for (auto _ : s) {
int sum = sumElements(arr, SIZE);
benchmark::DoNotOptimize(sum);
}
free(arr);
}
BENCHMARK(doBenchmark);
I tried identifying the cause of the speedup using perf stat
. But I am not sure how to interpret the results.
TL:DR: as 273K said, You made the manual optimization with -O1
, and your manually unrolled loop broke compiler's job with -O2
and -O3
. (Mainly because of auto-vectorization.)
In the actual compiler-generated asm (not the C++ source), one of the tradeoffs of loop unrolling is larger I-cache footprint. You can't measure the cost of that with a microbenchmark where only the unrolled loop runs; microbenchmarks in repeat loops make complicated implementations of loops like this or memcpy
look good. The actual cost depends on I-cache pressure in whatever program this is part of.
Another cost is that you might have more cleanup work to do for odd count
, if your function needs to be correct for all possible count
values (unlike your test case). e.g. an unroll of 16 makes the worst-case cleanup 15 elements; depending on how you do the cleanup, that could be worse for smallish problems that also only run a few iterations of the unrolled loop.
Yes, at -O0
where store/reload (store-forwarding latency) bottlenecks are a huge problem, manual unrolling helps for that reason. (Benchmarking at -O0
is normally pointless; the bottlenecks are different from in normal cases. In this case I think you realize that, although you didn't use register int
for any of your variables to let a -O0
build still keep them in registers.)
And at -O1
, maybe allowing two scalar adds per clock instead of 1, or at least bottlenecking on front-end throughput instead of 1 loop iteration per clock bottlenecked on loop-branch throughput.
But at -O2
/ -O3
when auto-vectorization is enabled for clang, manual unrolling usually makes compilers do worse when auto-vectorizing. e.g. they'll do stupid things like shuffle into vectors of odd/even elements to vectorize sumA
and sumB
separately. They're not smart, just complex pieces of machinery.
Optimal asm for this on x86-64 involves SSE2 paddd
(or AVX2 vpaddd
to do 32 bytes at once.) Unrolling that loop by 2 can help if the data is hot in L1d cache, but that would mean a scalar unroll of 8 or 16. (And like I said, compilers aren't good at rolling up scalar code into SIMD, especially with multiple accumulators. Although sometimes it works if the scalar unroll factor matches the vector width. I'd recommend keeping your loops simple)
Unlike GCC, clang already does unroll tiny loops by default, including when auto-vectorizing. Sometimes with multiple accumulators.
Clang 16 -O2
(Godbolt) made this asm for the inner loop of the non-unrolled version. It's 6 uops (assuming macro-fusion of the cmp/jcc, which will happen on Intel and AMD CPUs, https://agner.org/optimize/), so Alder Lake can run it at one iteration per clock cycle. (Only doing 2 loads per clock out of the 3 it's capable of... https://uops.info/)
... compute a loop bound and zero some regs
.LBB0_5: # =>This Inner Loop Header: Depth=1
movdqu xmm2, xmmword ptr [rdi + rsi]
paddd xmm0, xmm2 # one vector accumulator
movdqu xmm2, xmmword ptr [rdi + rsi + 16]
paddd xmm1, xmm2 # second vector accumulator
add rsi, 32
cmp rax, rsi
jne .LBB0_5 ; } while(i < max);
# Then combine to one vector
paddd xmm1, xmm0
# and horizontal sum down to scalar
pshufd xmm0, xmm1, 238 # xmm0 = xmm1[2,3,2,3]
paddd xmm0, xmm1
pshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
paddd xmm1, xmm0
movd eax, xmm1
# sum in EAX for benchmark::DoNotOptimize(sum);
cmp rdx, rcx
je .LBB0_8 ; benchmark repeat loop
Further unrolling, and/or using AVX with non-indexed addressing modes or aligned pointers to allow single-uop memory-source [v]paddd
, could reduce overhead enough that we bottleneck on 2 or 3 loads + vector-adds per 4 or 5 uops, letting Haswell / Ice Lake / Zen bottleneck on L1d load throughput (SIMD vector integer add throughput is at least as high as vector load throughput in all those CPUs, and paddd xmm0, [rdi]
is single-uop on all of them, but not vpaddd xmm0, xmm0, [rdi + rsi]
).
But to get a compiler to generate further-unrolled asm, you'd want to use intrinsics like _mm_load_si128
(aligned load, can fold into a memory source operand for legacy-SSE paddd
, unlike _mm_loadu_si128
unaligned load). Or maybe use profile-guided optimization so the compiler knows this loop is truly hot and will optimize it even more aggressively.
As I guessed, the asm for the manually-unrolled version involves shufps
instructions. It loads 4 vectors and uses 4x shufps
as a 2-input shuffle to get the odd vs. even elements from pairs of vectors, feeding 4x paddd
instructions.
Also note that your unrolled version is only equivalent for even loop counts, since you didn't write cleanup code for the odd element. And i < count
will read past the end of the array instead of stopping early. In the general case, i < (count - (unroll_factor-1))
will stop after the last "full" iteration, leaving some elements leftover instead of going too far. Notice how that expression simplifies to i < count
for an unroll_factor
of 1
; that's helpful for remembering it.
See also: How to remove "noise" from GCC/clang assembly output? and Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”