Loop takes less than 1 cycle despite dependency between iterations

I wanted to benchmark the time needed to do a single addition on my Skylake (i5-6500) CPU. C is low-level enough for me, so I wrote the following code:

// Initializing stuffs
int a = rand();
int b = rand();
const unsigned long loop_count = 1000000000;
unsigned int ignored; // used for __rdtscp 

// Warming up whatever needs to be warmed up
for (int i = 0; i < 100000; i++) {
    asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
    a += b;
}

// The actual measurement
uint64_t timer = __rdtscp(&ignored);
for (unsigned long i = 0; i < loop_count; i++) {
    asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
    a += b;
}
timer = __rdtscp(&ignored) - timer;

printf("%.2f cycles/iteration\n", (double)timer / loop_count);

Compiling with Clang 7.0.0 -O3, I get the following assembly (for the loop only):

# %bb.2:
    rdtscp
    movq    %rdx, %rdi
    movl    %ecx, 4(%rsp)
    shlq    $32, %rdi
    orq %rax, %rdi
    movl    $1000000000, %eax       # imm = 0x3B9ACA00
    .p2align    4, 0x90
.LBB0_3:                                # =>This Inner Loop Header: Depth=1
    #APP
    #NO_APP
    addl    %esi, %ebx
    addq    $-1, %rax
    jne .LBB0_3
# %bb.4:
    rdtscp

And running this code outputs

0.94 cycles/iteration

(or a number pretty much always between 0.93 and 0.96)

I'm surprised that this loop can execute in less than 1 cycle/iteration, since there is a data dependency on a that should prevent parallel execution of a += b.

IACA also confirms that the expected throughput is 0.96 cycles. llvm-mca on the other hand predicts a total of 104 cycles to execute 100 iterations of the loop. (I can edit in the traces if needed; let me know)

I observe a similar behavior when I use SSE registers rather than general purpose ones.

I can imagine that the CPU is smart enough to notice that b is constant and since addition is commutative, it could unroll the loop and optimize the additions somehow. However, I've never heard nor read anything about this. And furthermore, if this was what's going on, I'd expect better performances (ie. fewer cycles/iteration) than 0.94 cycles/iteration.

What is going on? How is this loop able to execute in less than 1 cycle per iteration?

Some background, for completeness. Ignore the remaining of the question if you're not interested in why I'm trying to benchmark a single addition.

I know that there are tools (llvm-exegesis for instance) designed to benchmark a single instruction and that I should instead of them (or just look at agner fog's docs). However, I'm actually trying to compare three different additions: one doing a single addition in a loop (the object of my question); one doing 3 additions per loop (on SSE registers, that should maximize port usage and not be limited by data dependencies), and one where the addition is implemented as a circuit in software. While the results are mostly as I expected; the 0.94 cycles/iteration for the version with a single addition in a loop left me puzzled.

Solution

The core frequency and the TSC frequency can be different. Your loop is expected to run at 1 core cycles per iteration. If the core frequency happens to be twice as the TSC frequency for the duration of the loop execution, the throughput would be 0.5 TSC cycles per iteration, which is equivalent to 1 core cycles per iteration.

In your case, it appears that the average core frequency was slightly higher than the TSC frequency. If you don't want to take dynamic frequency scaling into account when doing experiments, it'd be easier to just fix the core frequency to be equal to the TSC frequency so that you don't have to convert the numbers. Otherwise, you'd have to measure the average the core frequency as well.

On processors that support per-core frequency scaling, you have to either fix the frequency on all the cores or pin the experiments to a single core with fixed frequency. Alternatively, instead of measuring in TSC cycles, you can use a tool like perf to easily measure the time in core cycles or seconds.