clinuxlinux-kernelclockvdso

How does clock_gettime actually work in the kernel?


uint64_t getTimeLatencyNs() {
    struct timespec ts1;
    struct timespec ts2;
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
    return ((ts2.tv_sec - ts1.tv_sec) * NSEC + ts2.tv_nsec - ts1.tv_nsec);
}

The code above and existing questions show we can call clock_gettime() every few nanoseconds and it returns a different value.

The usual explanation is: due to the vDSO, there's no syscall involved and that's why it's quick. To my understanding, the vDSO only eliminates the (significant) overhead stemming from the syscall. But there's more happening in the background.

I want to understand what exactly happens upon calling clock_gettime(), to be able to reason about the temporal behavior of the function call. Assume I already understand (or are able to google) adjacent concepts like time synchronization (PTP, NTP), RTC or different clocks (e.g. CLOCK_MONOTONIC).

Jiffies (how it always worked)

I found a description of the Jiffies timing mechanism by Ingo Molnar1 from 2006. In summary, there's a value stored in memory, which is incremented every 1/CONFIG_HZ seconds. Calls to gettimeofday()/clock_gettime() simply read the memory address. On typical systems, the update happens every 1 or 4 milliseconds (i.e. clock resolution was a few milliseconds), see man time. But I can observe a nanosecond resolution in the example above. So Jiffies is not how it works on my system.

hrtimer (feature since Kernel 2.x)

Since Linux 2.6.21, CONFIG_HIGH_RES_TIMERS enables hrtimers to achieve a higher resolution, according to man time. On my system the config is set to yes. So I read up on hrtimer.

Thomas Gleixner is often credited for the high-resolution kernel timer subsystem. But hrtimer docs say they are not used for clocks. They say they plan to implement it, though. Currently, I'm still searching for an hrtimer skimming the clock_gettime() kernel source.

While this subsystem does not offer high-resolution clock sources just yet, the hrtimer subsystem can be easily extended with high-resolution clock capabilities,

Baeldung (explanation from 2024)

Searching more, the Baeldung article "Understanding Timekeeping and Clocks in Linux" from 2024 explains vDSO as enabling to call clock_gettime() without the syscall overhead, by memory mapping the function from kernel space into user space. It also has an interesting quote:

the vDSO can introduce latency spikes when the kernel updates shared memory areas with clock counters. This situation is more likely to occur with clocks accelerated by the vDSO.

However, their talking of latency spikes caused by updates to shared memory sounds to me like they are talking about Jiffies. Similarly, this answer claims clock_gettime() outliers to be caused by updates to shared memory, hitting a specific do ... while(unlikely...) case. I'm uncertain about the explanation, because the average call in that question take some nanoseconds, while the outliers lie in the microsecond range. Another run of that loop should only take nanoseconds as well.

TSC register (why hardware is precise)

The question How is the microsecond time of linux gettimeofday() obtained and what is its accuracy? got an answer which says, on recent hardware clock_gettime() reads from the constantly increasing TSC register, amdn also explains why it's constantly increasing. I assume that on my machine this is used in the end to achieve the high resolution, because it's a relatively recent x86 machine. There are similar registers on other platforms, e.g. arm's CCNT, but probably shared code is wrapped around the register access anyway.

There's a question from 2011 that mentions "hpet", but Wikipedia says it has some problems and is not used for clocks anymore, since TSC is running at constant speed in today's processors.

Blog (detailed explanation from 2013)

The most helpful to me is this explanation of the clock_gettime kernel sources from 2013.

From what I gather, there's still a shared memory location used when hrtimer are enabled. Depending on the kernel config, shared memory is either updated regularly via Jiffies (CONFIG_HZ) or via a timer interrupt (CONFIG_NO_HZ). Then vgetsns() is called to increase granularity, which I assume means it reads aforementioned TSC register. On my system NO_HZ is set, but HZ=1000 as well, so I don't fully understand it currently.

$ cat /boot/config | grep _HZ
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

Summary

As far as I understand, a call to clock_gettime() normally takes only a few nanoseconds, because of the vDSO mechanism. Still, sometimes there are outliers, which prompted a few questions on the behavior of clock_gettime() and its timing. The answers to those questions explain certain aspects of it and sometimes contradict each other. Hardware-wise the high resolution is achieved by making the TSC register reliably count clock cycles. I couldn't find a recent explanation, of how clock_gettime works conceptually. Before digging into kernel source and/or ftrace, I thought it fair time to ask the community.

So how does clock_gettime() actually work, especially regarding the timing and happenings "behind" the vDSO "curtain"?


Correcting any wrong assumptions of mine is highly welcome. If needed, I can draw a picture of my understanding.

1 In best manner for kernel mailing list, it starts with the words "[Previous mail is] Completely wrong!"


Solution

  • Your code calls clock_gettime():

    Now the stuff from __arch_get_vdso_data() is where vDSO really happens - the __vdso_clock_gettime64 does not itself "use" time source. It only reads preset values set in vdso_data. The values are set in update_vsyscall() https://elixir.bootlin.com/linux/v6.14.7/source/kernel/time/vsyscall.c#L78 , which is called from timekeeping https://elixir.bootlin.com/linux/v6.14.7/source/kernel/time/timekeeping.c#L687 .