clinuxx86microbenchmarkcpu-cycles

clock_gettime takes longer to execute when program run from terminal


I was trying to measure the time for a snippet of code and noticed that the timings were about 50ns faster when I ran the program from inside my editor, QtCreator, compared to when I ran it from a bash shell started in a gnome-terminal. I'm using Ubuntu 20.04 as OS.

A small program to reproduce my problem:

#include <stdio.h>
#include <time.h>

struct timespec now() {
  struct timespec now;
  clock_gettime(CLOCK_MONOTONIC, &now);
  return now;
}

long interval_ns(struct timespec tick, struct timespec tock) {
  return (tock.tv_sec - tick.tv_sec) * 1000000000L
      + (tock.tv_nsec - tick.tv_nsec);
}

int main() {
    // sleep(1);
    for (size_t i = 0; i < 10; i++) {
        struct timespec tick = now();
        struct timespec tock = now();
        long elapsed = interval_ns(tick, tock);
        printf("It took %lu ns\n", elapsed);
    }
    return 0;
}

Output when running from within QtCreator

It took 84 ns
It took 20 ns
It took 20 ns
It took 21 ns
It took 21 ns
It took 21 ns
It took 22 ns
It took 21 ns
It took 20 ns
It took 21 ns

And when running from my shell inside the terminal:

$ ./foo 
It took 407 ns
It took 136 ns
It took 74 ns
It took 73 ns
It took 77 ns
It took 79 ns
It took 74 ns
It took 81 ns
It took 74 ns
It took 78 ns

Things I've tried which didn't make a difference

I have verified that the same binary is called in all cases. I have verified that the nice value is 0 for the program in all cases.

The Question

Why does starting the program from my shell make a difference? Any suggestions on what to try?

Update


Solution

  • Just add more iterations to give the CPU time to ramp up to max clock speed. Your "slow" times are with the CPU at low-power idle clockspeed.

    QtCreator apparently uses enough CPU time to make this happen before your program runs, or else you're compiling + running and the compilation process serves as a warm-up. (vs. bash's fork/execve being lighter weight.)

    See Idiomatic way of performance evaluation? for more about doing warm-up runs when benchmarking, and also Why does this delay-loop start to run faster after several iterations with no sleep?

    On my i7-6700k (Skylake) running Linux, increasing the loop iteration count to 1000 is sufficient to get the final iterations running at full clock speed, even after the first couple iterations handling page faults, warming up the iTLB, uop cache, data caches, and so on.

    $ ./a.out      
    It took 244 ns
    It took 150 ns
    It took 73 ns
    It took 76 ns
    It took 75 ns
    It took 71 ns
    It took 72 ns
    It took 72 ns
    It took 69 ns
    It took 75 ns
    ...
    It took 74 ns
    It took 68 ns
    It took 69 ns
    It took 72 ns
    It took 72 ns        # 382 "slow" iterations in this test run (copy/paste into wc to check)
    It took 15 ns
    It took 15 ns
    It took 15 ns
    It took 15 ns
    It took 16 ns
    It took 16 ns
    It took 15 ns
    It took 15 ns
    It took 15 ns
    It took 15 ns
    It took 14 ns
    It took 16 ns
    ...
    

    On my system, energy_performance_preference is set to balance_performance, so the hardware P-state governor isn't as aggressive as with performance. Use grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference to check, use sudo to change it:

    sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_performance > "$i";done'
    

    Even running it under perf stat ./a.out is enough to ramp up to max clock speed very quickly, though; it really don't take much. But bash's command parsing after you press return is very cheap, not much CPU work done before it calls execve and reaches main in your new process.

    The printf with line-buffered output is what takes most of the CPU time in your program, BTW. That's why it takes so few iterations to ramp up to speed. e.g. if you run perf stat --all-user -r10 ./a.out, you'll see the user-space core clock cycles per second are only like 0.4GHz, the rest of the time spent in the kernel in write system calls.