c++templateslambdainlinestd-function

Understanding the overhead from std::function and capturing synchronous lambdas


Consider the following trivial example:

#include <QCoreApplication>
#include <QDebug>
#include <QElapsedTimer>
#include <QRandomGenerator>
#include <QDateTime>
#include <functional>

inline void reader(int * d, int c, std::function<void(int *)> foo) {
  while (c--) foo(d++);
}

template <class FOO>
inline void readerT(int * d, int c, FOO foo) {
  while (c--) foo(d++);
}

inline int summer(int * d, int c) {
  int s = 0;
  reader(d, c, [&s](int * dd){ s += *dd; });
  return s;
}

inline int summerT(int * d, int c) {
  int s = 0;
  readerT(d, c, [&s](int * dd){ s += *dd; });
  return s;
}

int main(int argc, char* argv[]) {
  QCoreApplication a(argc, argv);
  int tcount = 100000000;

  QVector<int> vec; vec.resize(tcount);
  QRandomGenerator gen = QRandomGenerator::securelySeeded();  
  gen.generate(vec.begin(), vec.end());

  QElapsedTimer t;
  t.restart();
  qDebug() << summer(vec.data(), tcount) << t.elapsed(); // 110 ms
  t.restart();
  qDebug() << summerT(vec.data(), tcount) << t.elapsed(); // 15 ms

  t.restart();
  int sum = 0;
  for (const auto & i : vec) { sum += i; }
  qDebug() << sum << t.elapsed(); // 15 ms

  return a.exec();
}

The test makes it evident that the non-template solution is considerably slower, where the template solution appears to be as fast as the regular loop, or zero overhead.

Why is the compiler not able to reach the same level of optimization for the non-template version?

Notes: compiled with GCC11 O3


Solution

  • The problem boils down to the fact that std::function's call mechanism is too complex as that it can be reliably inlined. Among the things necessary for an efficient std::function implementation are:

    For the compiler to undo all of these things, it would need to apply

    ... pretty much perfectly. Modern compilers are still quite bad at optimizing anything to do with function pointers. You're simply expecting too much.

    On the other hand, the template version is very easily inlined, as seen in the assembly output (https://godbolt.org/z/vvG3fnT5f). If you can inline reader, that enables partial loop unrolling, auto-vectorization, data flow analysis, and other optimizations which improve performance by an order of magnitude. Inlining makes or breaks it all.

    Example of Missed Optimizations

    Just to illustrate how bad modern compilers are in this field:

    int fls() { return 0; }
    int tru() { return 1; }
    
    int get(int i) {
        static constexpr int(*f[])() = { fls, tru };
        return f[i]();
    }
    

    Neither Clang nor GCC inline this (https://godbolt.org/z/zqjeG5xff), and both emit something along the lines of:

    get(int):
            movsx   rdi, edi
            jmp     [QWORD PTR get(int)::f[0+rdi*8]]
    

    The theoretical optimum in this very trivial example would be

    get(int):
            mov eax, edi
            ret
    

    This trivial example is asking too much, and inlining all of a std::function call dispatch perfectly is asking even more.

    Heap elision and function pointer inlining are generally not worth spending much effort on, since you can assume that if a developer is using function pointers and dynamic allocations, they probably need those and you won't be able to just optimize them out.