c++performance benchmarking loop-unrolling google-benchmark

Measuring the tradeoff of loop unrolling

Using google benchmark to benchmark the following functions:

int sumElements(int *arr, int count)
{
    int sum = 0;
    for (int i = 0; i < count; ++i) {
        sum += arr[i];
    }
    return sum;
}

int sumElementsUnrolled(int *arr, int count)
{
    int sumA = 0;
    int sumB = 0;
    for (int i = 0; i < count; i += 2) {
        sumA += arr[i];
        sumB += arr[i + 1];
    }
    return sumA + sumB;
}

Benchmark results:


--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
sumElements                965 ns          963 ns       723250
sumElementsUnrolled        641 ns          641 ns      1091406

I tried to benchmark these functions using google benchmark, and got a noticeable benefit at -O0 and -O1. I learned in several places that this is likely due to breaking dependency chains, which, if I understand correctly, leads to fewer pipeline stalls as the CPU is executing my instructions. Looking at the generated assembly at -O1, this makes sense: the unrolled version uses different registers for each sum variable, and the dependency chain is broken.

Questions:

Is this explanation correct, or am I way off?
If so, how do I measure pipeline stalls here? Running perf stat shows a much higher rate of stalled-cycles-backend for the unrolled version, but I don't really know if that's related.
If not, then what is the cause of the speedup? What does breaking a dependency chain do at the CPU level?

The benchmarking code is very simple:

#define SIZE 4096

int *initArray(int size)
{
    int *arr = (int*)malloc(size * sizeof(int));
    for (int i = 0; i < size; ++i) {
        arr[i] = rand();
    }
    return arr;
}

void doBenchmark(benchmark::State &s)
{
    int *arr = initArray(SIZE);
    for (auto _ : s) {
        int sum = sumElements(arr, SIZE);
        benchmark::DoNotOptimize(sum);
    }
    free(arr);
}


BENCHMARK(doBenchmark);

I tried identifying the cause of the speedup using perf stat. But I am not sure how to interpret the results.

Solution

TL:DR: as 273K said, You made the manual optimization with -O1, and your manually unrolled loop broke compiler's job with -O2 and -O3. (Mainly because of auto-vectorization.)

In the actual compiler-generated asm (not the C++ source), one of the tradeoffs of loop unrolling is larger I-cache footprint. You can't measure the cost of that with a microbenchmark where only the unrolled loop runs; microbenchmarks in repeat loops make complicated implementations of loops like this or memcpy look good. The actual cost depends on I-cache pressure in whatever program this is part of.

Another cost is that you might have more cleanup work to do for odd count, if your function needs to be correct for all possible count values (unlike your test case). e.g. an unroll of 16 makes the worst-case cleanup 15 elements; depending on how you do the cleanup, that could be worse for smallish problems that also only run a few iterations of the unrolled loop.

Yes, at -O0 where store/reload (store-forwarding latency) bottlenecks are a huge problem, manual unrolling helps for that reason. (Benchmarking at -O0 is normally pointless; the bottlenecks are different from in normal cases. In this case I think you realize that, although you didn't use register int for any of your variables to let a -O0 build still keep them in registers.)

And at -O1, maybe allowing two scalar adds per clock instead of 1, or at least bottlenecking on front-end throughput instead of 1 loop iteration per clock bottlenecked on loop-branch throughput.

But at -O2 / -O3 when auto-vectorization is enabled for clang, manual unrolling usually makes compilers do worse when auto-vectorizing. e.g. they'll do stupid things like shuffle into vectors of odd/even elements to vectorize sumA and sumB separately. They're not smart, just complex pieces of machinery.

Optimal asm for this on x86-64 involves SSE2 paddd (or AVX2 vpaddd to do 32 bytes at once.) Unrolling that loop by 2 can help if the data is hot in L1d cache, but that would mean a scalar unroll of 8 or 16. (And like I said, compilers aren't good at rolling up scalar code into SIMD, especially with multiple accumulators. Although sometimes it works if the scalar unroll factor matches the vector width. I'd recommend keeping your loops simple)

Unlike GCC, clang already does unroll tiny loops by default, including when auto-vectorizing. Sometimes with multiple accumulators.

Clang 16 -O2 (Godbolt) made this asm for the inner loop of the non-unrolled version. It's 6 uops (assuming macro-fusion of the cmp/jcc, which will happen on Intel and AMD CPUs, https://agner.org/optimize/), so Alder Lake can run it at one iteration per clock cycle. (Only doing 2 loads per clock out of the 3 it's capable of... https://uops.info/)

  ... compute a loop bound and zero some regs
.LBB0_5:                                # =>This Inner Loop Header: Depth=1
        movdqu  xmm2, xmmword ptr [rdi + rsi]
        paddd   xmm0, xmm2                         # one vector accumulator
        movdqu  xmm2, xmmword ptr [rdi + rsi + 16]
        paddd   xmm1, xmm2                         # second vector accumulator
        add     rsi, 32
        cmp     rax, rsi
        jne     .LBB0_5             ; } while(i < max);

# Then combine to one vector
        paddd   xmm1, xmm0
# and horizontal sum down to scalar
        pshufd  xmm0, xmm1, 238                 # xmm0 = xmm1[2,3,2,3]
        paddd   xmm0, xmm1
        pshufd  xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        paddd   xmm1, xmm0
        movd    eax, xmm1
  # sum in EAX  for   benchmark::DoNotOptimize(sum);

        cmp     rdx, rcx
        je      .LBB0_8     ; benchmark repeat loop

Further unrolling, and/or using AVX with non-indexed addressing modes or aligned pointers to allow single-uop memory-source [v]paddd, could reduce overhead enough that we bottleneck on 2 or 3 loads + vector-adds per 4 or 5 uops, letting Haswell / Ice Lake / Zen bottleneck on L1d load throughput (SIMD vector integer add throughput is at least as high as vector load throughput in all those CPUs, and paddd xmm0, [rdi] is single-uop on all of them, but not vpaddd xmm0, xmm0, [rdi + rsi]).

But to get a compiler to generate further-unrolled asm, you'd want to use intrinsics like _mm_load_si128 (aligned load, can fold into a memory source operand for legacy-SSE paddd, unlike _mm_loadu_si128 unaligned load). Or maybe use profile-guided optimization so the compiler knows this loop is truly hot and will optimize it even more aggressively.

As I guessed, the asm for the manually-unrolled version involves shufps instructions. It loads 4 vectors and uses 4x shufps as a 2-input shuffle to get the odd vs. even elements from pairs of vectors, feeding 4x paddd instructions.

Also note that your unrolled version is only equivalent for even loop counts, since you didn't write cleanup code for the odd element. And i < count will read past the end of the array instead of stopping early. In the general case, i < (count - (unroll_factor-1)) will stop after the last "full" iteration, leaving some elements leftover instead of going too far. Notice how that expression simplifies to i < count for an unroll_factor of 1; that's helpful for remembering it.