performancecachinglanguage-agnosticinlining

Can breaking into helper functions help performance?


Recently I was curious about why compilers don't always inline every function. One reason that I thought was interesting after I searched it up was the fact that inlining every function would blow up the size of the executable, and lead to bigger functions that might not fit into the cache.

But I'm curious if the reverse would apply then. If you had a massive function, then there's a chance that the function wouldn't fit into the cache after compilation. Would breaking the function apart into helper functions actually be good for performance then? This strikes me as interesting because I usually hear the opposite, that breaking into helper functions does incur some performance cost (unless inlined of course), we just accept the tradeoff for greater readability.


Solution

  • Yes. It can. This is typically a good idea when the code is pretty big, the inlined function does not contain hot loops, the inlined variants shares a significant part of their code and are called rather equally frequently. In this case, there will be a lot of instruction cache misses significantly slowing down instruction decoding and so the execution of the inlined code. Bigger codes takes more space in cache also causing more data misses due to a possible cache trashing. Sharing parts reduce the space of the code in caches so there is potentially less misses. In really pathological cases, the code can be so big that pages could not be loaded all in once by the operating system causing a reload of the page from the storage device on memory-bound devices (eg. embedded devices). Cold codes rarely called or called after a context switch should generally be small so to be fast. This is quite rare for the code to grow a lot except in few cases: template instantiation and semi-inlined recursive functions.

    Template instantiation can quickly generate a lot of code and it is not rare for a significant part to be shared between multiple instances. Sharing code is not free: non-inlined leaf functions reduce the code size at the expense of a function call possibly in a relatively hot loop, and far jumps tend to cause cache misses and not always easily predictable regarding the target code (unpredictable jumps are much more expensive).

    Recursive function can be partially inlined (few levels) so to mitigate the overhead of function calls and benefit from some optimization steps like constant propagation. If the code gets too big, the rarely execute branches can be shared and store far away so the code can better fit in the instruction cache (which is critical for hot recursive functions).

    Mainstream compilers can do such kind of optimization. For example, -ftree-tail-merge in GCC:

    Look for identical code sequences. When found, replace one with a jump to the other. This optimization is known as tail merging or cross jumping. This flag is enabled by default at -O2 and higher. The compilation time in this pass can be limited using max-tail-merge-comparisons parameter and max-tail-merge-iterations parameter.

    There is also the -fipa-icf flag in GCC for example:

    Perform Identical Code Folding for functions and read-only variables. The optimization reduces code size and may disturb unwind stacks by replacing a function by equivalent one with a different name. The optimization works more effectively with link-time optimization enabled.

    Note that you do not need to put the code into functions the compiler can share the identical parts, although this can help them (typically by not inlining the leaf calls, which Clang easily does for exemple).


    Related post: Do C compilers de-duplicate (merge) code?