clinuxsystem-callsmmapfcntl

Why can't I create read-only, shared mappings after setting F_SEAL_WRITE?


After doing fcntl(memfd, F_ADD_SEALS, F_SEAL_WRITE);, calls like mmap(NULL, 4096, PROT_READ, MAP_SHARED, memfd, 0); fail with error EPERM. Based on man 2 fcntl, my understanding of F_SEAL_WRITE is that it only prevents writable, shared mappings. Similarly, if I do the fcntl while I have such a read-only memory map, it fails with error EBUSY like I'd only expect it to if the map were writable. Why is this happening?

MCVE:

#include <unistd.h>
#include <fcntl.h>
#include <sys/syscall.h>
#include <sys/mman.h>

int main(void) {
    void *buf;
    int memfd = syscall(SYS_memfd_create, "foo", 2 /* MFD_ALLOW_SEALING */);
    ftruncate(memfd, 4096);
    buf = mmap(NULL, 4096, PROT_READ, MAP_SHARED, memfd, 0);
    fcntl(memfd, 1033 /* F_ADD_SEALS */, 8 /* F_SEAL_WRITE */); // will fail
    munmap(buf, 4096);
    fcntl(memfd, 1033 /* F_ADD_SEALS */, 8 /* F_SEAL_WRITE */);
    buf = mmap(NULL, 4096, PROT_READ, MAP_SHARED, memfd, 0); // will fail
    return 0;
}

When ran under strace (on Linux 4.4.0-135-generic from Ubuntu 16.04), it produces this:

memfd_create("foo", MFD_ALLOW_SEALING)  = 3
ftruncate(3, 4096)                      = 0
mmap(NULL, 4096, PROT_READ, MAP_SHARED, 3, 0) = 0x7fd9a9865000
fcntl(3, F_ADD_SEALS, F_SEAL_WRITE)     = -1 EBUSY (Device or resource busy)
munmap(0x7fd9a9865000, 4096)            = 0
fcntl(3, F_ADD_SEALS, F_SEAL_WRITE)     = 0
mmap(NULL, 4096, PROT_READ, MAP_SHARED, 3, 0) = -1 EPERM (Operation not permitted)

Solution

  • This was a Linux kernel bug, which is now fixed in Linux 6.7 and newer, by commits e8e17ee90eaf ("mm: drop the assumption that VM_SHARED always implies writable"), 28464bbb2ddc ("mm: update memfd seal write check to include F_SEAL_WRITE"), and 158978945f31 ("mm: perform the mapping_map_writable() check after call_mmap()").