caching process operating-system cpu-architecture micro-architecture

Memory loads experience different latency on the same core

I am trying to implement a cache based covert channel in C but noticed something weird. The physical address between the sender and the receiver is shared by using the mmap() call that maps to the same file with the MAP_SHARED option. Below is the code for the sender process which flushes an address from the cache to transmit a 1 and loads an address into the cache to transmit a 0. It also measures the latency of a load in both cases:

// computes latency of a load operation
static inline CYCLES load_latency(volatile void* p) {
        CYCLES t1 = rdtscp();
        load = *((int *)p);
        CYCLES t2 = rdtscp();
        return (t2-t1);
}

void send_bit(int one, void *addr) {

    if(one) {
        clflush((void *)addr);
        load__latency = load_latency((void *)addr);
        printf("load latency = %d.\n", load__latency);
        clflush((void *)addr);
    }
    else {
        x = *((int *)addr);
        load__latency = load_latency((void *)addr);
        printf("load latency = %d.\n", load__latency);
    }
}

int main(int argc, char **argv) {
    if(argc == 2)
    {
        bit = atoi(argv[1]);
    }
    // transmit bit
    init_address(DEFAULT_FILE_NAME);    
    send_bit(bit, address);
    return 0;
}

The load operation takes around 0 - 1000 cycles (during a cache-hit and a cache-miss) when issued by the same process.

The receiver program loads the same shared physical address and measures the latency during a cache-hit or a cache-miss, the code for which has been shown below:

int main(int argc, char **argv) {

    init_address(DEFAULT_FILE_NAME);
    rdtscp();
    load__latency = load_latency((void *)address);
    printf("load latency = %d\n", load__latency);

    return 0;
}

(I ran the receiver manually after the sender process terminated)

However, the latency observed in this scenario is very much different as compared to the first case. The load operation takes around 5000-1000 cycles.

Both the processes have been pinned to the same core-id by using the taskset command. So if I'm not wrong, during a cache-hit, both processes will experience a load latency of the L1-cache on a cache-hit and DRAM on a cache-miss. Yet, these two processes experience a very different latency. What could be the reason for this observation, and how can I have both the processes experience the same amount of latency?

Solution

The initial access to an mmaped region will page-fault (lazy mapping/allocation by the kernel), unless you use mmap(MAP_POPULATE), or mlock, or touch some other cache line of the page first.

You're probably timing a page fault if you only do one time measurement per mmap, or per run of a whole program.

(Also you don't seem to be doing anything to warm up the CPU frequency, so once core cycle could be many reference cycles. Some of the time for an L3 miss is fixed in terms of memory clock cycles, but another part of it scales with core/uncore clock.)

Also note that unless you run the 2nd process immediately (e.g. from the same shell command), the OS will get a chance to put that core into a deep sleep. On Intel CPUs at least, that empties L1d and L2 so it can power them down in the deeper C states. Probably also the TLBs.

It's also strange that you you cast away volatile in load = *((int *)p);
Assigning the load result to a global(?) variable inside the timed region is also pretty questionable; that could also soft page fault. If so, RDTSCP will have to wait for it, because the store can't retire.

(But on TLB hit for the store, it doesn't have to wait for it commit to cache, since there's no mfence before rdtscp. A store can retire while the store is still in the store buffer. In fact it must retire before the store buffer entry is known to be non-speculative so it can commit.)