TL;DR: many sources cite a statement that overly aggressive inlining of functions may sometimes hurt application performance because of code bloat or other factors. Is there an actual example of a program that demonstrates that in measurable manner?
Remember, a micro-benchmark’s mission in life is to magnify some aspect of your programs performance. Because of this anyone can readily generate a micro-benchmark that makes any given issue seem like a major problem. // Rico Mariani's Performance Tidbits
Many programmers I talk with have a notion that function inlining is unconditionally beneficial for application performance. C/C++ code that I often review has inline
(or equivalent) keyword gratuitously applied to functions, no matter their size, purpose, hotness or placement.
In many cases this strange habit (called "the inline disease" here) is harmless to overall performance: modern compilers have good judgement on what to actually inline, and very little code is hot enough for (non-)inlining to make any difference at all. Still, it is often hurtful to resulting software design: more stuff ends up in headers, files are no longer independently compilable, etc.
While it is rather easy to showcase that random inlining applied without continuous benchmarking does not make any measurable difference to final performance, I am looking for an extreme example when forcing the matter strictly hurts performance.
A microbenchmark will suffice; while it will not prove anything about effects of inlining on real-world applications, it should provably demonstrate that blindly enforcing its is not unconditionally good idea. This is really the idea behind almost any code optimization pass: it may help, it may hurt, and sometimes it makes no difference.
Some requirements for such an example microbenchmark.
inline
keyword or compiler-specific extensions (__forceinline
, __attribute__ ((always_inline))
etc.) can be used to instruct the compiler to do it despite its judgement.time a.out
, or internal calls to a timing facility around affected code.I do realize that performance is highly dependent on host parameters; what is slower on one machine may become as fast or faster on another. But I am looking for a worst-case scenario when unrestricted inlining is demonstrably counterproductive.
Ideally, other compiler backend options not affecting inlining (such as overall optimization level etc.) should be the same for two builds, in order to exclude the possibility that the observable difference is explained by them and not by the applied/skipped inlining.
In have an idea for a starting point for such a program, but I need further help with fleshing it out:
I've had a play and used GCC's loop unrolling to give me a lot of machine code from the following C code:
#include <stdint.h>
// from https://prng.di.unimi.it/xoshiro256starstar.c
static inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k));
}
static uint64_t s[4];
// a fast PRNG that will get inlined to generate lots of code
static uint64_t next(void) {
const uint64_t result = rotl(s[1] * 5, 7) * 9;
const uint64_t t = s[1] << 17;
s[2] ^= s[0];
s[3] ^= s[1];
s[1] ^= s[2];
s[0] ^= s[3];
s[2] ^= t;
s[3] = rotl(s[3], 45);
return result;
}
uint64_t benchmark() {
uint64_t sum = 0;
#ifdef UNROLL
#pragma GCC unroll 65534
#endif
for (int i = 0; i < 5000; i++) {
// do something
if (sum & 1) {
sum += next() >> 60;
} else {
sum += 1;
}
}
return sum;
}
#include <inttypes.h>
#include <stdio.h>
#include <sys/random.h>
#include <time.h>
int main() {
struct timespec t0, t1;
// initialise from urandom
getrandom(s, sizeof(s), 0);
// run test 5 times
for (int i = 0; i < 5; i++) {
clock_gettime(CLOCK_MONOTONIC, &t0);
uint64_t res = benchmark();
clock_gettime(CLOCK_MONOTONIC, &t1);
double dt = t1.tv_sec - t0.tv_sec;
dt += (t1.tv_nsec - t0.tv_nsec) / 1e9;
printf("took %.1f us to calculate %" PRIu64 "\n", dt * 1e6, res);
}
}
Saving this to inlinecost.c
and compiling via:
gcc -Wall -Os -o inlinecost inlinecost.c
gcc -Wall -Os -o inlinecost-unrolled inlinecost.c -DUNROLL
giving me the following binaries:
-rwxr-xr-x 1 smason smason 15656 Apr 29 16:30 inlinecost
-rwxr-xr-x 1 smason smason 597288 Apr 29 16:30 inlinecost-unrolled
Showing that it's certainly generating more code.
Running inlinecost
gives me:
took 12.4 us to calculate 26605
took 12.0 us to calculate 26265
took 12.3 us to calculate 26759
took 12.1 us to calculate 26487
took 12.3 us to calculate 26499
while inlinecost-unrolled
gives me:
took 167.4 us to calculate 27161
took 28.1 us to calculate 26685
took 24.8 us to calculate 26297
took 25.0 us to calculate 26388
took 24.2 us to calculate 26763
You can see that the non-inlined code runs a lot more consistently, while the unrolled version takes 10 times as long to load that machine code from RAM into cache and run it and then "only" takes twice as long to execute it.
Having the loop in benchmark
generate more iterations (e.g. increasing 5000 to 10000) makes this difference even more visible at the cost of taking a long time to compile.
Here's a link to GodBolt with only 5 iterations unrolled (too many iterations causes the build to timeout because it's generating too much code) showing that it's inlining the PRNG.
Hope that's useful!
Update: tried changing benchmark
to do:
uint64_t benchmark() {
uint64_t sum = next();
#ifdef UNROLL
#pragma GCC unroll 65534
#endif
for (int i = 0; i < 30000; i++) {
if (sum == 0) {
uint64_t x = 0;
#ifdef UNROLL
#pragma GCC unroll 1024
#endif
for (int j = 0; j < 4; j++) {
x ^= next();
}
sum += x >> 60;
} else {
sum += 1;
}
}
return sum;
}
this takes the unrolled version to ~400µs the first time around, then ~50µs on subsequent iterations, while the version that loops seems to reliably take ~7µs. I was expecting the branch predictor to struggle with this much code, but my CPU at least is doing a remarkably good job with this—an AMD 9900X, i.e. Zen5, not sure why I remembered Zen4 in my comment below.