After doing fcntl(memfd, F_ADD_SEALS, F_SEAL_WRITE);
, calls like mmap(NULL, 4096, PROT_READ, MAP_SHARED, memfd, 0);
fail with error EPERM
. Based on man 2 fcntl
, my understanding of F_SEAL_WRITE
is that it only prevents writable, shared mappings. Similarly, if I do the fcntl while I have such a read-only memory map, it fails with error EBUSY
like I'd only expect it to if the map were writable. Why is this happening?
MCVE:
#include <unistd.h>
#include <fcntl.h>
#include <sys/syscall.h>
#include <sys/mman.h>
int main(void) {
void *buf;
int memfd = syscall(SYS_memfd_create, "foo", 2 /* MFD_ALLOW_SEALING */);
ftruncate(memfd, 4096);
buf = mmap(NULL, 4096, PROT_READ, MAP_SHARED, memfd, 0);
fcntl(memfd, 1033 /* F_ADD_SEALS */, 8 /* F_SEAL_WRITE */); // will fail
munmap(buf, 4096);
fcntl(memfd, 1033 /* F_ADD_SEALS */, 8 /* F_SEAL_WRITE */);
buf = mmap(NULL, 4096, PROT_READ, MAP_SHARED, memfd, 0); // will fail
return 0;
}
When ran under strace
(on Linux 4.4.0-135-generic from Ubuntu 16.04), it produces this:
memfd_create("foo", MFD_ALLOW_SEALING) = 3
ftruncate(3, 4096) = 0
mmap(NULL, 4096, PROT_READ, MAP_SHARED, 3, 0) = 0x7fd9a9865000
fcntl(3, F_ADD_SEALS, F_SEAL_WRITE) = -1 EBUSY (Device or resource busy)
munmap(0x7fd9a9865000, 4096) = 0
fcntl(3, F_ADD_SEALS, F_SEAL_WRITE) = 0
mmap(NULL, 4096, PROT_READ, MAP_SHARED, 3, 0) = -1 EPERM (Operation not permitted)
This was a Linux kernel bug, which is now fixed in Linux 6.7 and newer, by commits e8e17ee90eaf ("mm: drop the assumption that VM_SHARED always implies writable"), 28464bbb2ddc ("mm: update memfd seal write check to include F_SEAL_WRITE"), and 158978945f31 ("mm: perform the mapping_map_writable() check after call_mmap()").