The current POSIX-wide implementation of system()
in glibc for the parent process:
1.a sets the process-wide signal handlers for SIGINT and SIGQUIT to ignore
1.b and blocks SIGCHLD.
The current Linux-specific implementation of posix_spawn()
in glibc blocks all signals on the parent process.
What are the reasons for these signal handling manipulations?
The current POSIX-wide implementation of
system()
in glibc for the parent process:1.a sets the process-wide signal handlers for SIGINT and SIGQUIT to ignore
1.b and blocks SIGCHLD.
system()
has a rather specific use-case:
Think of :read !<cmd>
in vim, or shell <cmd>
in gdb.
From POSIX 2018:
Ignoring SIGINT and SIGQUIT in the parent process prevents coordination problems (two processes reading from the same terminal, for example) when the executed command ignores or catches one of the signals. It is also usually the correct action when the user has given a command to the application to be executed synchronously (as in the
'!'
command in many interactive applications). In either case, the signal should be delivered only to the child process, not to the application itself. There is one situation where ignoring the signals might have less than the desired effect. This is when the application usessystem()
to perform some task invisible to the user. If the user typed the interrupt character ("^C"
, for example) whilesystem()
is being used in this way, one would expect the application to be killed, but only the executed command is killed. Applications that usesystem()
in this way should carefully check the return status fromsystem()
to see if the executed command was successful, and should take appropriate action when the command fails.Blocking SIGCHLD while waiting for the child to terminate prevents the application from catching the signal and obtaining status from
system()
's child process beforesystem()
can get the status itself.
The last paragraph refers to the mechanism for waiting on processes:
waitpid()
or similarIt is possible to do both or neither; but the child process will (unless configured) become a zombie process until the parent calls waitpid()
. I would therefore consider the SIGCHLD as an extra option, e.g. useful for single-threaded processes which don't want to block for child processes (consider shell background tasks).
Since the parent might have a signal handler installed for SIGCHLD, and that signal handler might call waitpid()
or similar before system()
internally executes waitpid()
, the system()
implementation might lose the information about the child exit. Only the first waitpid()
receives that information.
Note that we assume the parent process is single-threaded, therefore blocking SIGCHLD on the parent thread is sufficient.
You can also set options on any signal handler you install for SIGCHLD that influence this mess, which is something I have not looked into.
- The current Linux-specific implementation of posix_spawn() in glibc blocks all signals on the parent process.
This is because posix_spawn()
uses something like vfork()
. From glibc's spawni.c
:
/* The Linux implementation of posix_spawn{p} uses the clone syscall directly
with CLONE_VM and CLONE_VFORK flags and an allocated stack. The new stack
and start function solves most the vfork limitation (possible parent
clobber due stack spilling). The remaining issue are:
1. That no signal handlers must run in child context, to avoid corrupting
parent's state.
2. The parent must ensure child's stack freeing.
3. Child must synchronize with parent to enforce 2. and to possible
return execv issues.
The first issue is solved by blocking all signals in child, even
the NPTL-internal ones (SIGCANCEL and SIGSETXID). The second and
third issue is done by a stack allocation in parent, and by using a
field in struct spawn_args where the child can write an error
code. CLONE_VFORK ensures that the parent does not run until the
child has either exec'ed successfully or exited. */
CLONE_VFORK
is meant to provide one aspect of the venerable vfork()
function.
vfork
creates a child process but both the parent and the child process share the same memory:
Spawning a new process on Unix systems was originally based on the model of fork()
+exec()
:
First, the parent process calls fork()
which literally creates a "fork" in the execution
by spawning off a child process which runs on a copy of the memory of the parent process
and starts execution by returning from fork()
.
Both threads, the spawner in the parent and the new child thread return from fork()
but operate on different copies of the same memory contents.
The child process would then do some minor things like setting up the stdin/stdout/stderr file descriptors
and finally call exec()
.
That exec()
syscall replaces the child's memory with an image of the executable to be executed.
Since fork()
needs to provide a memory copy, it was originally slow and expensive
(there were no copy-on-write hardware features).
Therefore, vfork()
was provided: here, the parent and child process operate on the same memory.
That is, modifications to memory done by the child affect the parent and vice versa.
Because it's dangerous for two threads to operate on the same stack,
vfork()
blocks the spawner thread (in the parent process) until the child has either executed exec()
or exit()
.
The exec()
syscall stops memory sharing, in other words,
the child process will have its own memory after exec()
.
In Linux, there's neither a fork()
nor a vfork()
system call.
Instead, both features are provided via flags to the clone()
system call:
CLONE_VFORK
and CLONE_VM
.
CLONE_VFORK
implements the blocking aspect of vfork()
, that is,
it blocks the spawner thread until the child calls exec()
or exit()
.
CLONE_VM
controls the sharing of memory:
if set, both processes operate on the same memory (until exec()
);
if unset, the child will have a (copy-on-write) clone of the memory of the parent.
glibc according to the comment is mostly concerned about the child clobbering the stack of the parent.
To avoid this, it allocates a new stack for the child and uses clone()
with CLONE_VM | CLONE_VFORK
.
The clone()
syscall provides an additional parameter where you can set the stack of the child.
For vfork()
proper, that would be the stack of the parent;
glibc here places its dedicated allocation.
Any signal handler executed in the child also operate on the memory of the parent, but the state of the child process is a bit strange:
The threads of the parent process are not part of the same thread group (process), yet they access the same data and locks.
The thread ID and process ID of the parent do not fit to the thread-local storage of the child (glibc shares the TLS between parent and child by not setting CLONE_SETTLS
).
The child shares the memory with the parent, but other resources like file descriptors are not shared. File descriptors specifically are only inherited - meaning that closing a file descriptor only affects the child, for example.
If the parent (thread) receives a signal while waiting for the child in clone(CLONE_VFORK)
, it will only process that signal after the child exit()
s or exec()
s. (This is documented in the man page for vfork()
but not in the one for clone()
.)
Signals can either be sent to the child directly (for whatever reason), occur within the child (e.g. SIGSEGV or SIGTTIN) or affect an entire process group (e.g. SIGINT). Some of these aren't entirely under the control of the code between clone()
and exec()
(like SIGINT and SIGTTIN), which explains why the signals should not be handled in the child using the existing signal handlers of the parent.
posix_spawn()
therefore restores all signal handlers to their default disposition in the child.
This still doesn't quite explains why signals are blocked in the parent, though.
I assume glibc wants to avoid a race condition where a signal can arrive in the child in between the clone()
and the set-up of the signal handlers in the child.
Glibc does restore the signal mask after resetting the signal handlers in the child (and after performing the posix_spawn()
file actions).
Recently, the kernel added another flag CLONE_CLEAR_SIGHAND
for clone()
, which at least takes care of setting the signal disposition / uninstalling any custom signal handlers. I wonder if this means you could also get rid of the signal blocking (in the parent).
Since we have allocated a stack for the child, we also need to free it.
And it has been allocated in the memory space of the parent;
the child cannot free it (easily) because it runs on that stack up until exec()
which might even return on failure.
So the parent needs to wait until the child is done (exec()
successful or exit()
called)
before releasing the stack allocation.
I'm not sure why waitpid()
was not used as the synchronization mechanism,
the comment in glibc merely states that this synchronization is the reason for using CLONE_VFORK
.