C++ operator overload performance issue

Consider following scheme. We have 3 files:

main.cpp:

int main() {   
    clock_t begin = clock();
    int a = 0;
    for (int i = 0; i < 1000000000; ++i) {
        a += i;
    }
    clock_t end = clock();
    printf("Number: %d, Elapsed time: %f\n",
            a, double(end - begin) / CLOCKS_PER_SEC);

    begin = clock();
    C b(0);
    for (int i = 0; i < 1000000000; ++i) {
        b += C(i);
    }
    end = clock();
    printf("Number: %d, Elapsed time: %f\n",
            a, double(end - begin) / CLOCKS_PER_SEC);
    return 0;
}

class.h:

#include <iostream>
struct C {
public:
    int m_number;
    C(int number);
    void operator+=(const C & rhs);
};

class.cpp

C::C(int number)
: m_number(number)
{
}
void 
C::operator+=(const C & rhs) {
    m_number += rhs.m_number;
}

Files are compiled using clang++ with flags -std=c++11 -O3.

What I expected were very similar performance results, since I thought that compiler will optimize the operators not to be called as functions. The reality though was a bit different, here is the result:

Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 5.375751

I played around a bit and found out, that if I paste all of the code from class.* into the main.cpp the speed dramatically improves and results are very similar.

Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 0.000003

Than I realized that this behavior is probably caused by the fact, that compilation of main.cpp and class.cpp is completely separated and therefore compiler is unable to perform adequate optimizations.

My question: Is there any way of keeping the 3-file scheme and still achieve the optimization level as if the files were merged into one and than compiled? I have read something about 'unity builds' but that seems like an overkill.

Solution

Short answer

What you want is link time optimization. Try the answer from this question. I.e., try:

clang++ -O4 -emit-llvm main.cpp -c -o main.bc 
clang++ -O4 -emit-llvm class.cpp -c -o class.bc 
llvm-link main.bc class.bc -o all.bc
opt -std-compile-opts -std-link-opts -O3 all.bc -o optimized.bc
clang++ optimized.bc -o yourExecutable

You should see that your performance reaches the one that you had when pasting everything into main.cpp.

Long answer

The problem is that the compiler cannot inline your overloaded operator during linking, because it no longer has its definition in a form which it can use to inline it (it cannot inline bare machine code). Thus, the operator call in main.cpp will stay a real function call to the function declared in class.cpp. A function call is very expensive in comparison to a simple inlined addition which can be optimized further (e.g., vectorized).

When you enable link time optimization, the compiler is able to do this. As you see above, you first create llvm intermediate representation byte code (the .bc files, which I will simply call llvm code hereinafter) instead of machine code. You then link these files to a new .bc file which still contains llvm code instead of machine code. In contrast to machine code, the compiler is able to perform inlining on llvm code. opt is the llvm optimizer (be sure to install llvm), which performs the inlining and further link time optimizations. Then, we call clang++ a final time to generate executable machine code from the optimized llvm code.

For People with GCC

The answer above is only for clang. GCC (g++) users must use the -flto flag during compilation and during linking to enable link time optimization. It is simpler than with clang, simply add -flto everywhere:

      g++ -c -O2 -flto main.cpp
      g++ -c -O2 -flto class.cpp
      g++ -o myprog -flto -O2 main.o class.o