I was trying to measure the time for a snippet of code and noticed that the timings were about 50ns faster when I ran the program from inside my editor, QtCreator, compared to when I ran it from a bash shell started in a gnome-terminal. I'm using Ubuntu 20.04 as OS.
A small program to reproduce my problem:
#include <stdio.h>
#include <time.h>
struct timespec now() {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
return now;
}
long interval_ns(struct timespec tick, struct timespec tock) {
return (tock.tv_sec - tick.tv_sec) * 1000000000L
+ (tock.tv_nsec - tick.tv_nsec);
}
int main() {
// sleep(1);
for (size_t i = 0; i < 10; i++) {
struct timespec tick = now();
struct timespec tock = now();
long elapsed = interval_ns(tick, tock);
printf("It took %lu ns\n", elapsed);
}
return 0;
}
Output when running from within QtCreator
It took 84 ns
It took 20 ns
It took 20 ns
It took 21 ns
It took 21 ns
It took 21 ns
It took 22 ns
It took 21 ns
It took 20 ns
It took 21 ns
And when running from my shell inside the terminal:
$ ./foo
It took 407 ns
It took 136 ns
It took 74 ns
It took 73 ns
It took 77 ns
It took 79 ns
It took 74 ns
It took 81 ns
It took 74 ns
It took 78 ns
env -i
I have verified that the same binary is called in all cases. I have verified that the nice value is 0 for the program in all cases.
Why does starting the program from my shell make a difference? Any suggestions on what to try?
If I add a sleep(1) call to the beginning of main, both the QtCreator and gnome-terminal/bash invocations reports the longer execution times.
If I add a system("ps -H") call at the beginning of main, but drop the previously mentioned sleep(1): both invocations report the short execution times (~20 ns).
Just add more iterations to give the CPU time to ramp up to max clock speed. Your "slow" times are with the CPU at low-power idle clockspeed.
QtCreator apparently uses enough CPU time to make this happen before your program runs, or else you're compiling + running and the compilation process serves as a warm-up. (vs. bash
's fork/execve being lighter weight.)
See Idiomatic way of performance evaluation? for more about doing warm-up runs when benchmarking, and also Why does this delay-loop start to run faster after several iterations with no sleep?
On my i7-6700k (Skylake) running Linux, increasing the loop iteration count to 1000 is sufficient to get the final iterations running at full clock speed, even after the first couple iterations handling page faults, warming up the iTLB, uop cache, data caches, and so on.
$ ./a.out
It took 244 ns
It took 150 ns
It took 73 ns
It took 76 ns
It took 75 ns
It took 71 ns
It took 72 ns
It took 72 ns
It took 69 ns
It took 75 ns
...
It took 74 ns
It took 68 ns
It took 69 ns
It took 72 ns
It took 72 ns # 382 "slow" iterations in this test run (copy/paste into wc to check)
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 16 ns
It took 16 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 14 ns
It took 16 ns
...
On my system, energy_performance_preference is set to balance_performance
, so the hardware P-state governor isn't as aggressive as with performance
. Use grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
to check, use sudo
to change it:
sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_performance > "$i";done'
Even running it under perf stat ./a.out
is enough to ramp up to max clock speed very quickly, though; it really don't take much. But bash
's command parsing after you press return is very cheap, not much CPU work done before it calls execve
and reaches main
in your new process.
The printf
with line-buffered output is what takes most of the CPU time in your program, BTW. That's why it takes so few iterations to ramp up to speed. e.g. if you run perf stat --all-user -r10 ./a.out
, you'll see the user-space core clock cycles per second are only like 0.4GHz, the rest of the time spent in the kernel in write
system calls.