performancemicrobenchmarkinlining

Example of a microbenchmark to demonstrate that code inlining is not always beneficial to performance


TL;DR: many sources cite a statement that overly aggressive inlining of functions may sometimes hurt application performance because of code bloat or other factors. Is there an actual example of a program that demonstrates that in measurable manner?


Remember, a micro-benchmark’s mission in life is to magnify some aspect of your programs performance. Because of this anyone can readily generate a micro-benchmark that makes any given issue seem like a major problem. // Rico Mariani's Performance Tidbits

Many programmers I talk with have a notion that function inlining is unconditionally beneficial for application performance. C/C++ code that I often review has inline (or equivalent) keyword gratuitously applied to functions, no matter their size, purpose, hotness or placement.

In many cases this strange habit (called "the inline disease" here) is harmless to overall performance: modern compilers have good judgement on what to actually inline, and very little code is hot enough for (non-)inlining to make any difference at all. Still, it is often hurtful to resulting software design: more stuff ends up in headers, files are no longer independently compilable, etc.

While it is rather easy to showcase that random inlining applied without continuous benchmarking does not make any measurable difference to final performance, I am looking for an extreme example when forcing the matter strictly hurts performance.

A microbenchmark will suffice; while it will not prove anything about effects of inlining on real-world applications, it should provably demonstrate that blindly enforcing its is not unconditionally good idea. This is really the idea behind almost any code optimization pass: it may help, it may hurt, and sometimes it makes no difference.

Some requirements for such an example microbenchmark.

  1. It should be a reasonably short program, preferably in C or C++; other languages where inlining can be enforced are also welcome.
  2. It does not have to be program doing anything useful, it may do "silly" things just to load/stress underlying hardware.
  3. It should be possible to compile it in two modes: with inlining enforced and inlining disabled. Any technique to achieve that can be used: conditional compilation to redefine inlining annotations, compiler backend flags to control inlining, etc.
  4. It should be well-formed and exhibit the same well-defined behavior regardless of in which of two modes it is compiled.
  5. It should contain at least two functions, one calling another, with intention to affect inlining of at least one of them.
  6. It may contain any technique to guarantee/enforce function inlining. E.g., standard inline keyword or compiler-specific extensions (__forceinline, __attribute__ ((always_inline)) etc.) can be used to instruct the compiler to do it despite its judgement.
  7. When run, the performance (latency, execution time, or similar metric) can be easily reported. It could be just using time a.out, or internal calls to a timing facility around affected code.
  8. Finally, when compiled by at least one specific compiler of specific version and run on at least one target system, the resulting two variants of the program exhibit statistically significant difference, and the force-inlined build comes out slower than the non-inlined build.

I do realize that performance is highly dependent on host parameters; what is slower on one machine may become as fast or faster on another. But I am looking for a worst-case scenario when unrestricted inlining is demonstrably counterproductive.

Ideally, other compiler backend options not affecting inlining (such as overall optimization level etc.) should be the same for two builds, in order to exclude the possibility that the observable difference is explained by them and not by the applied/skipped inlining.


In have an idea for a starting point for such a program, but I need further help with fleshing it out:

  1. An inner function is huge enough to almost not fit into CPU's instruction cache.
  2. An outer function is big enough so that, if the inner one is forcefully inlined into it, the resulting code section becomes larger than CPU's instruction cache.
  3. The program's control flow is organized in such a way that, when everything is inlined, it experiences a higher frequency of instruction cache misses, cache flushes or similar events that won't happen if inlining is not enforced.

Solution

  • I've had a play and used GCC's loop unrolling to give me a lot of machine code from the following C code:

    #include <stdint.h>
    
    // from https://prng.di.unimi.it/xoshiro256starstar.c
    static inline uint64_t rotl(const uint64_t x, int k) {
        return (x << k) | (x >> (64 - k));
    }
    static uint64_t s[4];
    // a fast PRNG that will get inlined to generate lots of code
    static uint64_t next(void) {
        const uint64_t result = rotl(s[1] * 5, 7) * 9;
    
        const uint64_t t = s[1] << 17;
    
        s[2] ^= s[0];
        s[3] ^= s[1];
        s[1] ^= s[2];
        s[0] ^= s[3];
    
        s[2] ^= t;
    
        s[3] = rotl(s[3], 45);
    
        return result;
    }
    
    uint64_t benchmark() {
      uint64_t sum = 0;
    #ifdef UNROLL
      #pragma GCC unroll 65534
    #endif
      for (int i = 0; i < 5000; i++) {
    // do something
        if (sum & 1) {
          sum += next() >> 60;
        } else {
          sum += 1;
        }
      }
    
      return sum;
    }
    
    #include <inttypes.h>
    #include <stdio.h>
    #include <sys/random.h>
    #include <time.h>
    
    int main() {
      struct timespec t0, t1;
    
      // initialise from urandom
      getrandom(s, sizeof(s), 0);
    
      // run test 5 times
      for (int i = 0; i < 5; i++) {
        clock_gettime(CLOCK_MONOTONIC, &t0);
    
        uint64_t res = benchmark();
    
        clock_gettime(CLOCK_MONOTONIC, &t1);
    
        double dt = t1.tv_sec - t0.tv_sec;
        dt += (t1.tv_nsec - t0.tv_nsec) / 1e9;
    
        printf("took %.1f us to calculate %" PRIu64 "\n", dt * 1e6, res);
      }
    }
    

    Saving this to inlinecost.c and compiling via:

    gcc -Wall -Os -o inlinecost inlinecost.c
    gcc -Wall -Os -o inlinecost-unrolled inlinecost.c -DUNROLL
    

    giving me the following binaries:

    -rwxr-xr-x 1 smason smason  15656 Apr 29 16:30 inlinecost
    -rwxr-xr-x 1 smason smason 597288 Apr 29 16:30 inlinecost-unrolled
    

    Showing that it's certainly generating more code.

    Running inlinecost gives me:

    took 12.4 us to calculate 26605
    took 12.0 us to calculate 26265
    took 12.3 us to calculate 26759
    took 12.1 us to calculate 26487
    took 12.3 us to calculate 26499
    

    while inlinecost-unrolled gives me:

    took 167.4 us to calculate 27161
    took 28.1 us to calculate 26685
    took 24.8 us to calculate 26297
    took 25.0 us to calculate 26388
    took 24.2 us to calculate 26763
    

    You can see that the non-inlined code runs a lot more consistently, while the unrolled version takes 10 times as long to load that machine code from RAM into cache and run it and then "only" takes twice as long to execute it.

    Having the loop in benchmark generate more iterations (e.g. increasing 5000 to 10000) makes this difference even more visible at the cost of taking a long time to compile.

    Here's a link to GodBolt with only 5 iterations unrolled (too many iterations causes the build to timeout because it's generating too much code) showing that it's inlining the PRNG.

    Hope that's useful!

    Update: tried changing benchmark to do:

    uint64_t benchmark() {
      uint64_t sum = next();
    #ifdef UNROLL
    #pragma GCC unroll 65534
    #endif
      for (int i = 0; i < 30000; i++) {
        if (sum == 0) {
          uint64_t x = 0;
    #ifdef UNROLL
    #pragma GCC unroll 1024
    #endif
          for (int j = 0; j < 4; j++) {
            x ^= next();
          }
          sum += x >> 60;
        } else {
          sum += 1;
        }
      }
    
      return sum;
    }
    

    this takes the unrolled version to ~400µs the first time around, then ~50µs on subsequent iterations, while the version that loops seems to reliably take ~7µs. I was expecting the branch predictor to struggle with this much code, but my CPU at least is doing a remarkably good job with this—an AMD 9900X, i.e. Zen5, not sure why I remembered Zen4 in my comment below.