x86-64intelcpu-architecturepersistent-memory

What is the latency of `clwb` and `ntstore` on Intel's Optane Persistent Memory?


In this paper, it is written that the 8 bytes sequential write of clwb and ntstore of optane PM have 90ns and 62ns latency, respectively, and sequential reading is 169ns.

But in my test with Intel 5218R CPU, clwb is about 700ns and ntstore is about 1200ns. Of course, there is a difference between my test method and the paper, but the result is too bad, which is unreasonable. And my test is closer to actual usage.

During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate? If this is the case, is there a tool to detect it?

#include "libpmem.h"
#include "stdio.h"
#include "x86intrin.h"

//gcc aep_test.c -o aep_test -O3 -mclwb -lpmem

int main()
{
    size_t mapped_len;
    char str[32];
    int is_pmem;
    sprintf(str, "/mnt/pmem/pmmap_file_1");
    int64_t *p = pmem_map_file(str, 4096 * 1024 * 128, PMEM_FILE_CREATE, 0666, &mapped_len, &is_pmem);
    if (p == NULL)
    {
        printf("map file fail!");
        exit(1);
    }
    if (!is_pmem)
    {
        printf("map file fail!");
        exit(1);
    }

    struct timeval start;
    struct timeval end;
    unsigned long diff;
    int loop_num = 10000;

    _mm_mfence();
    gettimeofday(&start, NULL);

    for (int i = 0; i < loop_num; i++)
    {
        p[i] = 0x2222;
        _mm_clwb(p + i);
        // _mm_stream_si64(p + i, 0x2222);
        _mm_sfence();
    }

    gettimeofday(&end, NULL);

    diff = 1000000 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec;

    printf("Total time is %ld us\n", diff);
    printf("Latency is %ld ns\n", diff * 1000 / loop_num);

    return 0;
}

Any help or correction is much appreciated!


Solution

    1. The main reason is repeating flush to the same cacheline is delayed dramatically[1].
    2. You are testing the avg latency instead of best-case latency like the FAST20 papaer.
    3. ntstore are more expensive than clwb, so it's latency is higher. I guess it's a typo in your first paragraph.

    appended on 4.14

    Q: Tools to detect possible bottleneck on WPQ of buffers?
    A: You can get a baseline when PM is idle, and use this baseline to indicate the possible bottleneck.
    Tools:

    1. Intel Memory Bandwidth Monitoring
    2. Reads Two hardware counters from performance monitoring unit (PMU) in the processor: 1) UNC_M_PMM_WPQ_OCCUPANCY.ALL, which counts the accumulated number of WPQ entries at each cycle and 2) UNC_M_PMM_WPQ_INSERTS, which counts how many entries have been inserted into WPQ. And the calculate the queueing delay of WPQ: UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_PMM_WPQ_INSERTS. [2]

    [1] Chen, Youmin, et al. "Flatstore: An efficient log-structured key-value storage engine for persistent memory." Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.
    [2] Imamura, Satoshi, and Eiji Yoshida. “The analysis of inter-process interference on a hybrid memory system.” Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.