Why is a recursive constexpr function so much faster with reference parameters than with values?

I have the following constexpr Fibonacci function:

constexpr auto fibo(unsigned int number) -> unsigned long long {
    if (number < 2) return number;
    return fibo(number - 1) + fibo(number - 2);
}

If I benchmark it with the code below, I get the following output:

benchmark name                       samples       iterations    est run time
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
Benchmark fibo(46)                             100             1        2.66 m 
                                         1.60356 s     1.60231 s     1.60477 s 
                                        6.30399 ms    5.51902 ms    7.32077 ms

When I change the number parameter to unsigned int const &number, the speed improves significantly:

benchmark name                       samples       iterations    est run time
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
Benchmark fibo(46)                             100             1    17.7672 ms 
                                        177.013 us    176.272 us      177.9 us 
                                        4.13731 us    3.56318 us    4.80496 us

Why is there such a giant speed increase when passing the parameter by reference?

Benchmarking code:

#include "Fibonacci.hpp"
#include "catch2/catch_message.hpp"

#include <catch2/benchmark/catch_benchmark.hpp>
#include <catch2/catch_test_macros.hpp>
#include <catch2/generators/catch_generators.hpp>
#include <catch2/generators/catch_generators_all.hpp>
#include <catch2/generators/catch_generators_range.hpp>
#include <sstream>
#include <utility>

TEST_CASE("Fibonacci calculation", "[Fibonacci Benchmark Suite]") {
  auto entry = GENERATE(table<unsigned long long, unsigned long long>({
    {0, 0}, 
    {1, 1}, 
    {2, 1}, 
    {3, 2}, 
    {4, 3}, 
    {5, 5}, 
    {46, 1836311903} // Runs long!
    }));
  auto input = std::get<0>(entry);
  auto expected = std::get<1>(entry);
  std::ostringstream oss;
  oss << "Benchmark fibo(" << input << ")";
  auto benchmarkName = oss.str();
  BENCHMARK(std::move(benchmarkName)) {
    return fibo(input);
  };
  CAPTURE(input, expected);
  REQUIRE(fibo(input) == expected);
}

(Note: This code also seems kind of wonky, since compilation with GCC 14.2.1 sometimes resulted in it eating up all of my RAM before subsequently getting killed by the Linux kernel. I've added the flags -fconstexpr-ops-limit=100000000000000 and -fconstexpr-cache-depth=50, but they did not seem to help)

Solution

What an innocent looking short but fantastically awkward test case for compilers!

I have never seen the halting problem directly affect compilation before but GCC in trying to honour the recursive specification of this code suffers an exponential increase of compile time and memory usage with larger values of fixed N a known constant at compile time when computing fibo(N).

The safe limits on N where GCC can fully resolve fibo(N) to a compile time constant are N <= 35 for call by value and N <= 28 for call by reference. I have no idea why the upper limits are different just that they are. The C code is deceptively similar but the assembler generated is radically different!

I also tried it on MSVC C/C++ 17 and Intel ICX 2024.1 but only got quite normal behaviour and fairly similar performance for both. Intel ICX 2025 appears to cache values it has previously computed iff you get the gofaster compiler options just right. I'm unable to reproduce that transient observation at present. "-O3 -Wall" isn't enough.

I took a look at this code in detail because the recursive routine doesn't actually alter the value of number so I wasn't expecting to see much difference between call by value and call by reference.

Using GCC for small N <= 28 known constant at compile time that is true - it compiles to load magic constant fibo(N) with exactly the same generated code for both and a very long compile time (that grows exponentially with N).

I created a canonical simplest possible toy version on [Godbolt](https://godbolt.org/z/h6G1z7WPh). Please be kind to the site and don't feed in big N! Godbolt can just about survive N=36 on a good day if the wind is blowing in the right direction. Expect to get process killed if you go higher.

You can try the various combos by commenting lines in or out N=28,29 (c-b-r) and 35,36 (c-b-v) are the most interesting fixed value cases along with N unknown at compile time.

The huge difference in speed has almost nothing to do with call-by-value vs call-by-reference and everything to do with extremely sophisticated code generation when the c-b-r path is taken.

It pretty much puts everything into registers and uses much bigger steps decrementing N by 5 for each recursion level and then terminating in a lookup table when N<= 5.

The first case with call by value is nothing out of the ordinary. Except that for small values of N<35 where it compiles very slowly to load the constant value fibo(N). The general optimisation consists of unrolling the recursion one level and writing its own code out twice (approx. 300 bytes).

The second case code generation is extraordinary at least on GCC -O3 (I will try harder to provoke the same behaviour on Intel & GCC). It is using tricks that suggest to me that it has been very carefully optimised to perform incredibly well on this particular type of benchmark.

There are three scenarios I was able to obtain but I will restrict this discussion to the two main ones. For N<29 value known at compile time it takes ages to compile to load constant value fibo(N). Larger constant values will kill the compiler (or cause the OS to do so) due to exponential memory usage or timeouts (on Godbolt). To hide the value from the compiler I put it in a string which is good enough to make it generate generic code for any N (approx. 160 bytes).

I have examined the assembler code generated by the GCC compiler for the two sample routines and then reconstructed functionally equivalent C code below to help explain more easily how the call by reference code obtains its apparently magical speed.

It has used a tail recursion trick to minimise the recursion depth. It takes bites of 5 off the value of N for each recursive call and then has a lookup table for N=1 through 5! It is a quite remarkable code transformation away from the original sourcecode.

Here is the code. fibo_cbv is the relatively normal inlined code generation so typical of modern compilers but makes heavy use of the stack. fibo_cbr is the unusual code generated by GCC for this particular benchmark function. The latter is really rather cunning! My C code isn't quite exactly equivalent to the compiled assembler as I needed to tweak it a bit to get it working correctly.

BTW Recursion is a terrible way to compute these values!

     #include <cmath>
     #include <string>
     #include <stdio.h>

    // N       : 1  2  3  4  5  6  7  8  9 10 11
    // fibo(N) : 1  1  2  3  5  8 13 21 34 55 89

    constexpr auto fibo(unsigned int number) -> unsigned long long {
    //constexpr auto fibo(unsigned int const& number) -> unsigned long long {
        if (number < 2) return number;
        return fibo(number - 1) + fibo(number - 2);
    }

    constexpr auto fibo_cbv(unsigned int number) -> unsigned long long {
        unsigned int result = 0;
        if (number < 2) return number;  
        if (--number < 2) return number; else result = fibo_cbv(number);
        if (--number < 2) return result + number; 
        return result + fibo_cbv(number);
    }

    constexpr auto fibo_cbr(unsigned int const& number) -> unsigned long long {
        switch (number) {
          case 1: return 1;
          case 2: return 1;
          case 3: return 2;
          case 4: return 3;
          case 5: return 5;
          case 6: return 8;
          case 7: return 13;
          default:
      //     return fibo_cbr(number - 1) + fibo_cbr(number - 2);// works much slower
          unsigned long long a = fibo_cbr(number - 6);
          unsigned long long b = fibo_cbr(number - 5);
          return 5 * a + 8 * b;
        }  
    }

    int main()
    {
        std::string input{ "28" };
        unsigned int n;
        for (n = 1; n < 36; n++)
        {
            printf("\nfibo_ref(%i) = %i\n", n,  fibo(n));
            printf("fibo_cbv(%i) = %i\n", n,  fibo_cbv(n));
            printf("fibo_cbr(%i) = %i\n", n, fibo_cbr(n));
        }
    //    return fibo(36);  // fixed value to test
        n = stol(input);
        return fibo(n);
    }

I also noticed whilst experimenting that if I read in a variable N the call by reference code from GCC became even more complicated using a powers of 2 strategy instead to take chunks of 8 or 4 off the initial value of N. I'm not sure which aspect of the test code took it down that path. I can't reproduce it again right now but I am sure I didn't hallucinate that version of the generated code.

Worth noting also that N=92 for fibo(92) is as high as you can go without overflowing for uint64_t.

I haven't been able to get any other C/C++ compiler I tried apart from GCC to do the extensive compile time computations needed to compute fibo(N) to a static constant the slow recursive way.

The short answer to why it is faster for call by reference is that it is using a radically different algorithm that looks nothing like the original C source but is functionally equivalent.

Nothing at all to do with the call by reference vs call by value mechanism.