c++performanceassemblycpu-architecturegoogle-benchmark

Why is ONE basic arithmetic operation in for loop body executed SLOWER THAN TWO arithmetic operations?


While I experimented with measuring time of execution of arithmetic operations, I came across very strange behavior. A code block containing a for loop with one arithmetic operation in the loop body was always executed slower than an identical code block, but with two arithmetic operations in the for loop body. Here is the code I ended up testing:

#include <iostream>
#include <chrono>

#define NUM_ITERATIONS 100000000

int main()
{
    // Block 1: one operation in loop body
    {
        int64_t x = 0, y = 0;
        auto start = std::chrono::high_resolution_clock::now();

        for (long i = 0; i < NUM_ITERATIONS; i++) {x+=31;}

        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> diff = end-start;
        std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
    }

    // Block 2: two operations in loop body
    {
        int64_t x = 0, y = 0;
        auto start = std::chrono::high_resolution_clock::now();

        for (long i = 0; i < NUM_ITERATIONS; i++) {x+=17; y-=37;}

        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> diff = end-start;
        std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
    }

    return 0;
}

I tested this with different levels of code optimization (-O0,-O1,-O2,-O3), with different online compilers (for example onlinegdb.com), on my work machine, on my hame PC and laptop, on RaspberryPi and on my colleague's computer. I rearranged these two code blocks, repeated them, changed constants, changed operations (+, -, <<, =, etc.), changed integer types. But I always got similar result: the block with one line in loop is SLOWER than block with two lines:

1.05681 seconds. x,y = 3100000000,0
0.90414 seconds. x,y = 1700000000,-3700000000

I checked the assembly output on https://godbolt.org/ but everything looked like I expected: second block just had one more operation in assembly output.

Three operations always behaved as expected: they are slower than one and faster than four. So why two operations produce such an anomaly?

Edit:

Let me repeat: I have such behaviour on all of my Windows and Unix machines with code not optimized. I looked at assembly I execute (Visual Studio, Windows) and I see the instructions I want to test there. Anyway if the loop is optimized away, there is nothing I ask about in the code which left. I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about. The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away. The difference in time of execution is 5-25% on my tests (quite noticeable).


Solution

  • This effect only happens at -O0 (or with volatile), and is a result of the compiler keeping your variables in memory (not registers). You'd expect that to just introduce a fixed amount of extra latency into a loop-carried dependency chains through i, x, and y, but modern CPUs are not that simple.

    On Intel Sandybridge-family CPUs, store-forwarding latency is lower when the load uop runs some time after the store whose data it's reloading, not right away. So an empty loop with the loop counter in memory is the worst case. I don't understand what CPU design choices could lead to that micro-architectural quirk, but it's a real thing.

    This is basically a duplicate of Adding a redundant assignment speeds up code when compiled without optimization, at least for Intel Sandybridge-family CPUs.

    This is is one of the major reasons why you shouldn't benchmark at -O0: the bottlenecks are different than in realistically optimized code. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why compilers make such terrible asm on purpose.

    Micro-benchmarking is hard; you can only measure something properly if you can get compilers to emit realistically optimized asm loops for the thing you're trying to measure. (And even then you're only measuring throughput or latency, not both; those are separate things for single operations on out-of-order pipelined CPUs: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)

    See @rcgldr's answer for measurement + explanation of what would happen with loops that keep variables in registers.

    With clang, benchmark::DoNotOptimize(x1 += 31) also de-optimizes into keeping x in memory, but with GCC it does just stay in a register. Unfortunately @SashaKnorre's answer used clang on QuickBench, not gcc, to get results similar to your -O0 asm. It does show the cost of lots of short-NOPs being hidden by the bottleneck through memory, and a slight speedup when those NOPs delay the reload next iteration just long enough for store-forwarding to hit the lower latency good case. (QuickBench I think runs on Intel Xeon server CPUs, with the same microarchitecture inside each CPU core as desktop version of the same generation.)


    Presumably all the x86 machines you tested on had Intel CPUs from the last 10 years, or else there's a similar effect on AMD. It's plausible there's a similar effect on whichever ARM CPU your RPi uses, if your measurements really were meaningful there. Otherwise, maybe another case of seeing what you expected (confirmation bias), especially if you tested with optimization enabled there.


    I tested this with different levels of code optimization (-O0,-O1,-O2,-O3) [...] But I always got similar result

    I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about.

    (later from comments) About optimizations: yes, I reproduced that with different optimization levels, but as the loops were optimized away, the execution time was too fast to say for sure.

    So actually you didn't reproduce this effect for -O1 or higher, you just saw what you wanted to see (confirmation bias) and mostly made up the claim that the effect was the same. If you'd accurately reported your data (measurable effect at -O0, empty timed region at -O1 and higher), I could have answered right away.

    See Idiomatic way of performance evaluation? - if your times don't increase linearly with increasing repeat count, you aren't measuring what you think you're measuring. Also, startup effects (like cold caches, soft page faults, lazy dynamic linking, and dynamic CPU frequency) can easily lead to the first empty timed region being slower than the second.

    I assume you only swapped the loops around when testing at -O0, otherwise you would have ruled out there being any effect at -O1 or higher with that test code.


    The loop with optimization enabled:

    As you can see on Godbolt, gcc fully removes the loop with optimization enabled. Sometimes GCC leaves empty loops alone, like maybe it thinks the delay was intentional, but here it doesn't even loop at all. Time doesn't scale with anything, and both timed regions look the same like this:

    orig_main:
       ...
            call    std::chrono::_V2::system_clock::now()       # demangled C++ symbol name
            mov     rbp, rax                                    # save the return value = start
            call    std::chrono::_V2::system_clock::now()
            # end in RAX
    

    So the only instruction in the timed region is saving start to a call-preserved register. You're measuring literally nothing about your source code.

    With Google Benchmark, we can get asm that doesn't optimize the work away, but which doesn't store/reload to introduce new bottlenecks:

    #include <benchmark/benchmark.h>
    
    static void TargetFunc(benchmark::State& state) {
       uint64_t x2 = 0, y2 = 0;
      // Code inside this loop is measured repeatedly
      for (auto _ : state) {
        benchmark::DoNotOptimize(x2 += 31);
        benchmark::DoNotOptimize(y2 += 31);
      }
    }
    // Register the function as a benchmark
    BENCHMARK(TargetFunc);
    
    # just the main loop, from gcc10.1 -O3 
    .L7:                         # do{
            add     rax, 31        # x2 += 31
            add     rdx, 31        # y2 += 31
            sub     rbx, 1
            jne     .L7          # }while(--count != 0)
    

    I assume benchmark::DoNotOptimize is something like asm volatile("" : "+rm"(x) ) (GNU C inline asm) to make the compiler materialize x in a register or memory, and to assume the lvalue has been modified by that empty asm statement. (i.e. forget anything it knew about the value, blocking constant-propagation, CSE, and whatever.) That would explain why clang stores/reloads to memory while GCC picks a register: this is a longstanding missed-optimization bug with clang's inline asm support. It likes to pick memory when given the choice, which you can sometimes work around with multi-alternative constraints like "+r,m". But not here; I had to just drop the memory alternative; we don't want the compiler to spill/reload to memory anyway.

    For GNU C compatible compilers, we can use asm volatile manually with only "+r" register constraints to get clang to make good scalar asm (Godbolt), like GCC. We get an essentially identical inner loop, with 3 add instructions, the last one being an add rbx, -1 / jnz that can macro-fuse.

    static void TargetFunc(benchmark::State& state) {
       uint64_t x2 = 0, y2 = 0;
      // Code inside this loop is measured repeatedly
      for (auto _ : state) {
          x2 += 16;
          y2 += 17;
        asm volatile("" : "+r"(x2), "+r"(y2));
      }
    }
    

    All of these should run at 1 clock cycle per iteration on modern Intel and AMD CPUs, again see @rcgldr's answer.

    Of course this also disables auto-vectorization with SIMD, which compilers would do in many real use cases. Or if you used the result at all outside the loop, it might optimize the repeated increment into a single multiply.

    You can't measure the cost of the + operator in C++ - it can compile very differently depending on context / surrounding code. Even without considering loop-invariant stuff that hoists work. e.g. x + (y<<2) + 4 can compile to a single LEA instruction for x86.


    The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away

    TL:DR: it's not the operations, it's the loop-carried dependency chain through memory that stops the CPU from running the loop at 1 clock cycle per iteration, doing all 3 adds in parallel on separate execution ports.

    Note that the loop counter increment is just as much of an operation as what you're doing with x (and sometimes y).