c++profilingc++20execution-time

How to measure the execution time of inlined functions in C++?


I have a bunch of functions that complete in a few nanoseconds and I want to measure their execution time to assess their performance.

The basic idea is very simple, take a timestamp, execute the same blocks of code in a loop many times, take another timestamp, calculate the difference between the timestamps and divide the value by the number of iterations.

It is all very simple, the problem is when I put the benchmarking code in the main function, the compiler recognizes that I do the same calculations and do nothing with the result, so it makes the loop no-operation. So I get very small numbers.

I have managed to make the code execute by faking a usage, it works in Visual Studio 2022 but I have failed to make it work in Code::Blocks (MSYS2 g++ compiler).

This method is difficult to reuse, since whenever I want to benchmark another function I have to copy-paste the same lines of code. So I made it a function, and this increases reusability but blows the numbers way up. A lot of my functions are inlined, and using the function to benchmark other functions the inline optimization is disregarded.

Minimal reproducible example:

#include <array>
#include <chrono>
#include <cmath>
#include <functional>
#include <iomanip>
#include <iostream>
#include <utility>
#include <vector>

using std::array;
using std::chrono::steady_clock;
using std::chrono::duration;
using std::cout;
using std::vector;
using std::function;
double d = 0.0;
float r = 0.0;

constexpr double DU = 1.0 / (uint64_t(1) << 52);
constexpr float FU = 1.0 / (1 << 23);
constexpr array<double, 10> LOG2_POLY9 = {
    -3.27179702e00,
    7.19195108e00,
    -7.34289702e00,
    5.01474324e00,
    -1.64079955e00,
    -3.25941179e-01,
    5.83100708e-01,
    -2.58134034e-01,
    5.44419681e-02,
    -4.66794032e-03,
};

template <std::size_t N, std::size_t... I>
inline double apply_poly_impl(double m1, const std::array<double, N>& data, std::index_sequence<I...>) {
    double s = 0;
    double m = 1;
    ((s += std::get<I>(data) * m, m *= m1), ...);
    return s;
}

template <std::size_t N>
inline double apply_poly(double m1, const std::array<double, N>& data) {
    return apply_poly_impl(m1, data, std::make_index_sequence<N>{});
}

inline float fast_log2_p9(float f) {
    uint32_t bits = std::bit_cast<uint32_t>(f);
    int e = ((bits >> 23) & 0xff) - 127;
    double m1 = 1 + (bits & 0x7fffff) * FU;
    double s = apply_poly(m1, LOG2_POLY9);
    return e + s;
}

template <typename T>
double timeit(const function<T(T)>& func, const vector<T>& values, int runs = 1048576){
    auto start = steady_clock::now();
    size_t len = values.size();
    for (int64_t i = 0; i < runs; i++) {
        func(values[i % len]);
    }
    auto end = steady_clock::now();
    duration<double, std::nano> time = end - start;
    return time.count() / runs;
}

int main()
{
    double r4096 = 1.0 / 4096;
    double n;
    vector<double> random_doubles(256);
    vector<float> random_floats(256);
    for (int j = 0; j < 256; j++) {
        n = rand() % 4096 + (rand() % 4096) * r4096;
        random_doubles[j] = n;
        random_doubles[j] = float(n);
    }
    cout << std::setprecision(16);
    auto start = steady_clock::now();
    for (int64_t i = 0; i < 1048576; i++) {
        r = fast_log2_p9(random_floats[i % 256]);
    }
    auto end = steady_clock::now();
    duration<double, std::nano> time = end - start;
    r = fast_log2_p9(0.7413864135742188f);
    cout << "fast_log_p9: " << time.count() / 1048576 << " nanoseconds\n";
    cout << r << '\n';
    cout << timeit<float>(fast_log2_p9, random_floats);
}

Compiled in Visual Studio 2022 with the following flags:

/permissive- /ifcOutput "x64\Release\" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob1 /sdl /Fd"x64\Release\vc143.pdb" /Zc:inline /fp:fast /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /std:c17 /Gd /Oi /MD /std:c++20 /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Ot /Fp"x64\Release\exponentiation.pch" /diagnostics:column 

Output:

PS C:\Users\Xeni> C:\Users\Xeni\source\repos\exponentiation\x64\Release\exponentiation.exe
fast_log_p9: 0.9078025817871094 nanoseconds
-0.4317024052143097
15.96231460571289

Compiled with Code::Blocks, commands:

g++.exe -Wall -fexceptions -fomit-frame-pointer -fexpensive-optimizations -flto -O3 -m64 --std=c++20 -march=native -fext-numeric-literals  -c D:\MyScript\CodeBlocks\testapp\main.cpp -o obj\Release\main.o
g++.exe  -o bin\Release\testapp.exe obj\Release\main.o  -O3 -flto -s -static-libstdc++ -static-libgcc -static -m64  

Output:

PS C:\Users\Xeni> D:\MyScript\CodeBlocks\testapp\bin\Release\testapp.exe
fast_log_p9: 9.5367431640625e-05 nanoseconds
-0.4317024052143097
7.897567749023438

So how can I measure the execution time of a code-block with all optimizations on with a method that is easily reusable?


For whatever it is worth I have benchmarked std::log2 provided by Visual Studio 2022 and mine is clearly faster and as precise with /fp:fast flag:

    start = steady_clock::now();
    for (int64_t i = 0; i < 1048576; i++) {
        r = std::log2f(random_floats[i % 256]);
    }
    end = steady_clock::now();
    time = end - start;
    r = std::log2f(0.7413864135742188f);
    cout << "std::log2f: " << time.count() / 1048576 << " nanoseconds\n";
    cout << r << '\n';
fast_log2_p9: 0.9081840515136719 nanoseconds
-0.4317024052143097
std::log2f: 54.67109680175781 nanoseconds
-0.4317024052143097

Additionally, I use Visual Studio 2022 on Windows 10 x64, it should be extremely obvious, and quick-bench.com doesn't use MSVC compiler. My functions use inline, I have just seen that with inline enabled my function clocked 41 nanoseconds, but after I removed the three inline keywords the function clocked 29 nanoseconds. There would be a very high error because I have only run it twice.

But clearly the benchmark disregards inline because disabling the inline should make the code run much slower, it doesn't.

And benchmark::DoNotOptimize(r); is used, while I don't know its exact purpose, clearly the optimization isn't enabled, and my question clearly stated that I want to measure the execution time of the function WITH ALL OPTIMIZATIONS ENABLED, because of course I would use the code with all optimizations.

And while it is true CPU's don't execute C++ code and C++ functions aren't translated into a continuous sequence of instructions, clearly I am stating that I want to measure the time that a function would typically take to complete, an average across many runs that I should expect.


Solution

  • Performance measurement of C++ code can be very hard.

    The chief problem is compilers. CPU's do not execute your C++ code. They execute CPU instructions. Your C++ code is translated to CPU instructions. Modern compilers, as a fundamental part of their design, do not translate every fragment of C++ code to a unique contiguous sequence of CPU instructions.

    Thus, your question "how much time does this function take" already starts with the assumption that the function corresponds to a sequence of CPU instructions, with a start and an end, and no other instructions in between.

    To make things worse, even if the compiler would output such a sequence, a modern CPU doesn't care that much. Out-of-Order Execution has been a standard CPU feature for decades by now.

    And what exactly is the end of an instruction? The time when the value is written back to memory? With modern CPU's, that might not even happen. The CPU core has a queue of outstanding writes, and a later write can make an earlier write moot.

    Your basic idea is to insert C++ benchmarking code. This really, really interferes with your program. You are no longer measuring the original functions. You're measuring new functions. Note that the timing code is functionally unrelated. Your out-of-order CPU can detect that instructions for steady_clock::now(); are unrelated to fast_log2_p9, so it can just decide to start with fast_log2_p9 even though the C++ code says it shouldn't.

    Because of this, the fundamental idea of this question is just flawed. If you care about nanoseconds, you need to understand pipelined, out-of-order CPU's, the effects of cache and memory, and other hardware realities.