Understanding the overhead from std::function and capturing synchronous lambdas

Consider the following trivial example:

#include <QCoreApplication>
#include <QDebug>
#include <QElapsedTimer>
#include <QRandomGenerator>
#include <QDateTime>
#include <functional>

inline void reader(int * d, int c, std::function<void(int *)> foo) {
  while (c--) foo(d++);
}

template <class FOO>
inline void readerT(int * d, int c, FOO foo) {
  while (c--) foo(d++);
}

inline int summer(int * d, int c) {
  int s = 0;
  reader(d, c, [&s](int * dd){ s += *dd; });
  return s;
}

inline int summerT(int * d, int c) {
  int s = 0;
  readerT(d, c, [&s](int * dd){ s += *dd; });
  return s;
}

int main(int argc, char* argv[]) {
  QCoreApplication a(argc, argv);
  int tcount = 100000000;

  QVector<int> vec; vec.resize(tcount);
  QRandomGenerator gen = QRandomGenerator::securelySeeded();  
  gen.generate(vec.begin(), vec.end());

  QElapsedTimer t;
  t.restart();
  qDebug() << summer(vec.data(), tcount) << t.elapsed(); // 110 ms
  t.restart();
  qDebug() << summerT(vec.data(), tcount) << t.elapsed(); // 15 ms

  t.restart();
  int sum = 0;
  for (const auto & i : vec) { sum += i; }
  qDebug() << sum << t.elapsed(); // 15 ms

  return a.exec();
}

The test makes it evident that the non-template solution is considerably slower, where the template solution appears to be as fast as the regular loop, or zero overhead.

Why is the compiler not able to reach the same level of optimization for the non-template version?

Notes: compiled with GCC11 O3

Solution

The problem boils down to the fact that std::function's call mechanism is too complex as that it can be reliably inlined. Among the things necessary for an efficient std::function implementation are:

small object optimization so that smaller function objects don't get stored on the heap
type erasure, typically using polymorphic classes
special cases for function pointers to avoid excessive levels of indirection

For the compiler to undo all of these things, it would need to apply

inlining
heap elision
control flow analysis/data flow analysis

... pretty much perfectly. Modern compilers are still quite bad at optimizing anything to do with function pointers. You're simply expecting too much.

On the other hand, the template version is very easily inlined, as seen in the assembly output (https://godbolt.org/z/vvG3fnT5f). If you can inline reader, that enables partial loop unrolling, auto-vectorization, data flow analysis, and other optimizations which improve performance by an order of magnitude. Inlining makes or breaks it all.

Example of Missed Optimizations

Just to illustrate how bad modern compilers are in this field:

int fls() { return 0; }
int tru() { return 1; }

int get(int i) {
    static constexpr int(*f[])() = { fls, tru };
    return f[i]();
}

Neither Clang nor GCC inline this (https://godbolt.org/z/zqjeG5xff), and both emit something along the lines of:

get(int):
        movsx   rdi, edi
        jmp     [QWORD PTR get(int)::f[0+rdi*8]]

The theoretical optimum in this very trivial example would be

get(int):
        mov eax, edi
        ret

This trivial example is asking too much, and inlining all of a std::function call dispatch perfectly is asking even more.

Heap elision and function pointer inlining are generally not worth spending much effort on, since you can assume that if a developer is using function pointers and dynamic allocations, they probably need those and you won't be able to just optimize them out.