I am doing a checkpoint-and restore using CRIU; in turn after restore, my application wakes with some threads that have their stack mmaped into files on disk (CRIU doesn't do it by default, this is a custom optimization). Later on, I want to transparently replace this mapping with anonymous memory - allocating new one, copying it over and finally calling mremap
to the original address.
However, there's a glitch in this approach - if the threads start mutating the stack while I copy it over I could break the application. Ideally, I would trap it using userfaultfd
but it's not possible to register on a file-mapped memory region. Even if I introduced some mutex to those threads there's no way to tell that the thread is really parked and won't mutate its stack until I wake it up.
I am thinking of mprotect
to read-only and handling SIGSEGV
. Or is there a better approach? PTrace self?
The only alternative I have come up with that I would trust is for the main thread to use ptrace to force the others to stop, and then to resume them when that is safe. You seem to already be aware of this option, so I will not go into details. The main objective here is to preemptively suspend the activity of the affected threads while their stacks are being copied, which seems far less risky than approaches that do otherwise.
The alternative presented in the question is to use mprotect to trap the threads' attempts to modify data on their stacks while the copy is being made. I guess the idea is to have a lighter touch, allowing threads to proceed as long as they can do so without modifying their stacks, but I don't think that's plausible or viable. Among other things:
it seems unlikely in general that any thread will be able to do much meaningful work without modifying its stack, so it seems doubtful that there is much gain available in practice.
as I observed in comments, both C and POSIX specify that a program has undefined behavior if a signal handler for SIGSEGV
returns normally. Usually, program termination is the only viable alternative, but a sufficiently prepared program might in some cases longjmp()
or siglongjmp()
out of the handler instead. That could give you a vector for recovery, but only to whatever extent you are prepared to mediate it with special tooling, and only to the extent supported by such tooling.
It is not safe to assume that the trap handler installed by the kernel will have the effect of retrying the failed instruction of your userspace program in the event that a handler for a segfault returns normally. That ranks very high among the implications of the userspace behavior being undefined. If you did observe that effect with a particular combination of hardware and software then that would be no basis for relying on the same thing for different combinations.