c++multithreadingcpu-architectureatomicmicrobenchmark

Delay in atomic variable update reflection across threads


I am interested in the exploring the minimum time in which some write in a variable can be reflected across threads. For this I am using a global atomic variable and updating it periodically. Meanwhile, another thread spins and checks for updated value. Both threads are attached to separate isolated cores (OS - ubuntu).

//global
constexpr int total = 100;
atomic<int64_t> var;
void reader()
{
    int count = 0;
    int64_t tps[total];

    int64_t last = 0;
    while(count < total)
    {
        int64_t send_tp = var.load(std::memory_order_seq_cst);
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast<nanoseconds>(tp.time_since_epoch()).count();    

        if (send_tp != last)
        {
            last = send_tp;
            tps[count] = curr - send_tp;
            count++;
        }
    }

    for(auto i = 0; i<total; i++)
        cout << tps[i] << endl;
}
void writer()
{
    for (int i=0; i<total; i++)
    {
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast<nanoseconds>(tp.time_since_epoch()).count();
        var.store(curr, std::memory_order_seq_cst);

        // adding delay in writes, so that none are missed
        while(duration_cast<nanoseconds>(high_resolution_clock::now() - tp).count() < 100000000);
    }
}

Using this program, I'm getting around 70 nanoseconds median time.

I also tried to measure the overheads

void overhead() {
    int count = 0;
    int64_t tps[total];

    int64_t last = 0;
    while(count < total)
    {
        auto tp1 = high_resolution_clock::now();
        int64_t to_send = duration_cast<nanoseconds>(tp1.time_since_epoch()).count();
        var.store(to_send, std::memory_order_seq_cst);

        int64_t send_tp = var.load(std::memory_order_seq_cst);
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast<nanoseconds>(tp.time_since_epoch()).count();    

        if (send_tp != last)
        {
            last = send_tp;
            tps[count] = curr - send_tp;
            count++;
        }
    }

    for(auto i = 0; i<total; i++)
        cout << tps[i] << endl;
}

I know atomics will not have much overhead in a single thread access, and it turned out to have a median of 30 nanoseconds (I guess due to chrono::high_resolution_clock()).

So this concludes that the delay is around 40 nanoseconds (median). I tried different memory orderings, like memory_order_relaxed or release-acquire but the results were pretty similar.

From my understanding, the sync needed is just fetching the L1 cache line from adjacent core, so why is it taking around 40 nanoseconds for this. Am I missing something, or any suggestions on how the setup can be improved?

Hardware details -

Intel(R) Core(TM) i9-9900K CPU (hyperthreading disabled)

Compiled : g++ file.cpp -lpthread -O3


Solution

  • 40ns inter-thread latency (including measurement overhead) sounds about right for modern x86 CPUs.

    And yeah, storing a timestamp and checking it against a time measurement in the reader sounds reasonable.

    Cache-coherency messages between cores have to go over the ring bus to the L3 slice. When the load request (that missed in L2) gets to the right L3 slice, it will detect (from the inclusive L3 tags) that another thread owns the line in MESI Exclusive or Modified state, and generate a message to that core. That core will then do a write back (and perhaps send the data directly to the core that requested it?)

    And that's on a desktop CPU where we know there are no other sockets to snoop for coherency: Intel server CPUs have significantly higher memory latency and inter-core latency.