c++linuxmmap

Using mmap with pwrite together


Assume a C/C++ Linux application that have a log file of a fixed size and two threads that operate this log file: Producer and Consumer. Producer thread produces large portions of data that must be persisted in a log file as a contiguous block. Consumer thread reads the data randomly from that log file (until the data from the log file is moved to some long-term storage).

I want to use pwrite and fsync to write data to the log by Producer thread, because, according to numerous sources, this is quite common practice for keeping the log. But the same time I want to make Consumer thread read the log file through mmap, because the data is read randomly and I want to speed up reads through madvise.

The question is: does the operating system (Linux, OS X or any other POSIX compliant OS) provide guarantees, that if the page is currently loaded to memory from mmap'ed file, it would be automatically updated or at least invalidated when the Producer thread updates the corresponding block in a log file through pwrite? If not, what should I do to let the Consumer thread see the updated data immediately after update?


Solution

  • A similar question was recently raised in the context of https://github.com/etcd-io/bbolt/pull/989#discussion_r2155315446. Man pages don't answer it directly, but as was already noted in comments here and as common sense suggests, pwrite() should be synchronized with the page cache. The question that is the most interesting in this context is how exactly it happens and what can the expectations be for synchronization between reader and writer threads. As this simple (and somewhat crude) test suggests, not a lot can be expected:

    #include <fcntl.h>
    #include <pthread.h>
    #include <stddef.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <sys/types.h>
    #include <unistd.h>
    
    size_t const pageSize = 4096;
    char *mm;
    
    void *reader(void *arg)
    {
            for (;;) {
                    int i;
                    char first, next;
    
                    for (i = 0; i < pageSize; i++) {
                            if (i == 0) {
                                    first = mm[i];
                            } else {
                                    next = mm[i];
                                    if (next != first) {
                                            printf("reader mismatch, first %hhX, read %hhX at %d\n", first, next, i);
                                            exit(11);
                                    }
                            }
                    }
            }
    }
    
    int main()
    {
            char b[pageSize];
            int fd, r, i;
            ssize_t written;
            pthread_t pth;
    
            fd = open("data", O_RDWR | O_CREAT);
            if (fd < 0) {
                    printf("bad fd: %d\n", fd);
                    exit(10);
            }
            r = ftruncate(fd, pageSize);
            if (r < 0) {
                    printf("bad ftruncate: %d\n", r);
                    exit(10);
            }
            mm = mmap(NULL, pageSize, PROT_READ, MAP_SHARED, fd, 0);
            if (mm == NULL) {
                    printf("mmap failed\n");
                    exit(10);
            }
            r = pthread_create(&pth, NULL, reader, NULL);
            if (r < 0) {
                    printf("bad pthread_create: %d\n", r);
                    exit(10);
            }
            for (i = 0; i < 1000; i++) {
                    memset(b, i%256, pageSize);
                    written = pwrite(fd, b, pageSize, 0);
                    if (written != pageSize) {
                            printf("can't write properly\n");
                            exit(10);
                    }
            }
            printf("Done\n");
            return 0;
    }
    

    It quickly gives reader mismatch, first 4, read 5 at 1478 and alike, so there is no atomic page replacement or some other magic, reader can observe any partially written data at any time. Which makes the scheme not much different from writing to mmapped region directly.

    The only thing changed with O_DIRECT (the test is easy to adapt to it with Write error: Invalid argument, when file is opened with O_DIRECT changes for buffer alignment) is that the same mismatch tends to happen earlier, but can lead to cases like reader mismatch, first 0, read 3 at 2485.