clock_gettime takes longer to execute when program run from terminal

I was trying to measure the time for a snippet of code and noticed that the timings were about 50ns faster when I ran the program from inside my editor, QtCreator, compared to when I ran it from a bash shell started in a gnome-terminal. I'm using Ubuntu 20.04 as OS.

A small program to reproduce my problem:

#include <stdio.h>
#include <time.h>

struct timespec now() {
  struct timespec now;
  clock_gettime(CLOCK_MONOTONIC, &now);
  return now;
}

long interval_ns(struct timespec tick, struct timespec tock) {
  return (tock.tv_sec - tick.tv_sec) * 1000000000L
      + (tock.tv_nsec - tick.tv_nsec);
}

int main() {
    // sleep(1);
    for (size_t i = 0; i < 10; i++) {
        struct timespec tick = now();
        struct timespec tock = now();
        long elapsed = interval_ns(tick, tock);
        printf("It took %lu ns\n", elapsed);
    }
    return 0;
}

Output when running from within QtCreator

It took 84 ns
It took 20 ns
It took 20 ns
It took 21 ns
It took 21 ns
It took 21 ns
It took 22 ns
It took 21 ns
It took 20 ns
It took 21 ns

And when running from my shell inside the terminal:

$ ./foo 
It took 407 ns
It took 136 ns
It took 74 ns
It took 73 ns
It took 77 ns
It took 79 ns
It took 74 ns
It took 81 ns
It took 74 ns
It took 78 ns

Things I've tried which didn't make a difference

Letting QtCreator start the program in a terminal
Using rdtsc and rdtscp calls instead of clock_gettime (same relative differences in runtime)
clearing the environment from the terminal by running it under env -i
Starting the program using sh instead of bash

I have verified that the same binary is called in all cases. I have verified that the nice value is 0 for the program in all cases.

The Question

Why does starting the program from my shell make a difference? Any suggestions on what to try?

Update

If I add a sleep(1) call to the beginning of main, both the QtCreator and gnome-terminal/bash invocations reports the longer execution times.
If I add a system("ps -H") call at the beginning of main, but drop the previously mentioned sleep(1): both invocations report the short execution times (~20 ns).

Solution

Just add more iterations to give the CPU time to ramp up to max clock speed. Your "slow" times are with the CPU at low-power idle clockspeed.

QtCreator apparently uses enough CPU time to make this happen before your program runs, or else you're compiling + running and the compilation process serves as a warm-up. (vs. bash's fork/execve being lighter weight.)

See Idiomatic way of performance evaluation? for more about doing warm-up runs when benchmarking, and also Why does this delay-loop start to run faster after several iterations with no sleep?

On my i7-6700k (Skylake) running Linux, increasing the loop iteration count to 1000 is sufficient to get the final iterations running at full clock speed, even after the first couple iterations handling page faults, warming up the iTLB, uop cache, data caches, and so on.

$ ./a.out      
It took 244 ns
It took 150 ns
It took 73 ns
It took 76 ns
It took 75 ns
It took 71 ns
It took 72 ns
It took 72 ns
It took 69 ns
It took 75 ns
...
It took 74 ns
It took 68 ns
It took 69 ns
It took 72 ns
It took 72 ns        # 382 "slow" iterations in this test run (copy/paste into wc to check)
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 16 ns
It took 16 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 14 ns
It took 16 ns
...

On my system, energy_performance_preference is set to balance_performance, so the hardware P-state governor isn't as aggressive as with performance. Use grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference to check, use sudo to change it:

sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_performance > "$i";done'

Even running it under perf stat ./a.out is enough to ramp up to max clock speed very quickly, though; it really don't take much. But bash's command parsing after you press return is very cheap, not much CPU work done before it calls execve and reaches main in your new process.

The printf with line-buffered output is what takes most of the CPU time in your program, BTW. That's why it takes so few iterations to ramp up to speed. e.g. if you run perf stat --all-user -r10 ./a.out, you'll see the user-space core clock cycles per second are only like 0.4GHz, the rest of the time spent in the kernel in write system calls.