Consider the following trivial example:
#include <QCoreApplication>
#include <QDebug>
#include <QElapsedTimer>
#include <QRandomGenerator>
#include <QDateTime>
#include <functional>
inline void reader(int * d, int c, std::function<void(int *)> foo) {
while (c--) foo(d++);
}
template <class FOO>
inline void readerT(int * d, int c, FOO foo) {
while (c--) foo(d++);
}
inline int summer(int * d, int c) {
int s = 0;
reader(d, c, [&s](int * dd){ s += *dd; });
return s;
}
inline int summerT(int * d, int c) {
int s = 0;
readerT(d, c, [&s](int * dd){ s += *dd; });
return s;
}
int main(int argc, char* argv[]) {
QCoreApplication a(argc, argv);
int tcount = 100000000;
QVector<int> vec; vec.resize(tcount);
QRandomGenerator gen = QRandomGenerator::securelySeeded();
gen.generate(vec.begin(), vec.end());
QElapsedTimer t;
t.restart();
qDebug() << summer(vec.data(), tcount) << t.elapsed(); // 110 ms
t.restart();
qDebug() << summerT(vec.data(), tcount) << t.elapsed(); // 15 ms
t.restart();
int sum = 0;
for (const auto & i : vec) { sum += i; }
qDebug() << sum << t.elapsed(); // 15 ms
return a.exec();
}
The test makes it evident that the non-template solution is considerably slower, where the template solution appears to be as fast as the regular loop, or zero overhead.
Why is the compiler not able to reach the same level of optimization for the non-template version?
Notes: compiled with GCC11 O3
The problem boils down to the fact that std::function
's call mechanism is too complex as that it can be reliably inlined.
Among the things necessary for an efficient std::function
implementation are:
For the compiler to undo all of these things, it would need to apply
... pretty much perfectly. Modern compilers are still quite bad at optimizing anything to do with function pointers. You're simply expecting too much.
On the other hand, the template
version is very easily inlined, as seen in the assembly output (https://godbolt.org/z/vvG3fnT5f).
If you can inline reader
, that enables partial loop unrolling, auto-vectorization, data flow analysis, and other optimizations which improve performance by an order of magnitude.
Inlining makes or breaks it all.
Just to illustrate how bad modern compilers are in this field:
int fls() { return 0; }
int tru() { return 1; }
int get(int i) {
static constexpr int(*f[])() = { fls, tru };
return f[i]();
}
Neither Clang nor GCC inline this (https://godbolt.org/z/zqjeG5xff), and both emit something along the lines of:
get(int):
movsx rdi, edi
jmp [QWORD PTR get(int)::f[0+rdi*8]]
The theoretical optimum in this very trivial example would be
get(int):
mov eax, edi
ret
This trivial example is asking too much, and inlining all of a std::function
call dispatch perfectly is asking even more.
Heap elision and function pointer inlining are generally not worth spending much effort on, since you can assume that if a developer is using function pointers and dynamic allocations, they probably need those and you won't be able to just optimize them out.