[SOLVED] Why do glibc's `system()` and `posix

The current POSIX-wide implementation of system() in glibc for the parent process:

1.a sets the process-wide signal handlers for SIGINT and SIGQUIT to ignore

1.b and blocks SIGCHLD.

system() has a rather specific use-case:

parent is a single-threaded process
spawn an interactive process
parent is non-interactive while child is running

Think of :read !<cmd> in vim, or shell <cmd> in gdb.

From POSIX 2018:

Ignoring SIGINT and SIGQUIT in the parent process prevents coordination problems (two processes reading from the same terminal, for example) when the executed command ignores or catches one of the signals. It is also usually the correct action when the user has given a command to the application to be executed synchronously (as in the '!' command in many interactive applications). In either case, the signal should be delivered only to the child process, not to the application itself. There is one situation where ignoring the signals might have less than the desired effect. This is when the application uses system() to perform some task invisible to the user. If the user typed the interrupt character ("^C", for example) while system() is being used in this way, one would expect the application to be killed, but only the executed command is killed. Applications that use system() in this way should carefully check the return status from system() to see if the executed command was successful, and should take appropriate action when the command fails.

Blocking SIGCHLD while waiting for the child to terminate prevents the application from catching the signal and obtaining status from system()'s child process before system() can get the status itself.

The last paragraph refers to the mechanism for waiting on processes:

you can have the kernel send a SIGCHLD to the parent when the child exits
the parent can execute waitpid() or similar

It is possible to do both or neither; but the child process will (unless configured) become a zombie process until the parent calls waitpid(). I would therefore consider the SIGCHLD as an extra option, e.g. useful for single-threaded processes which don't want to block for child processes (consider shell background tasks).

Since the parent might have a signal handler installed for SIGCHLD, and that signal handler might call waitpid() or similar before system() internally executes waitpid(), the system() implementation might lose the information about the child exit. Only the first waitpid() receives that information.

Note that we assume the parent process is single-threaded, therefore blocking SIGCHLD on the parent thread is sufficient.

You can also set options on any signal handler you install for SIGCHLD that influence this mess, which is something I have not looked into.

The current Linux-specific implementation of posix_spawn() in glibc blocks all signals on the parent process.

This is because posix_spawn() uses something like vfork(). From glibc's spawni.c:

/* The Linux implementation of posix_spawn{p} uses the clone syscall directly
   with CLONE_VM and CLONE_VFORK flags and an allocated stack.  The new stack
   and start function solves most the vfork limitation (possible parent
   clobber due stack spilling). The remaining issue are:
   
   1. That no signal handlers must run in child context, to avoid corrupting
      parent's state.
   2. The parent must ensure child's stack freeing.
   3. Child must synchronize with parent to enforce 2. and to possible
      return execv issues.
      
   The first issue is solved by blocking all signals in child, even
   the NPTL-internal ones (SIGCANCEL and SIGSETXID).  The second and
   third issue is done by a stack allocation in parent, and by using a
   field in struct spawn_args where the child can write an error
   code. CLONE_VFORK ensures that the parent does not run until the
   child has either exec'ed successfully or exited.  */

CLONE_VFORK is meant to provide one aspect of the venerable vfork() function. vfork creates a child process but both the parent and the child process share the same memory:

Spawning a new process on Unix systems was originally based on the model of fork()+exec(): First, the parent process calls fork() which literally creates a "fork" in the execution by spawning off a child process which runs on a copy of the memory of the parent process and starts execution by returning from fork(). Both threads, the spawner in the parent and the new child thread return from fork() but operate on different copies of the same memory contents. The child process would then do some minor things like setting up the stdin/stdout/stderr file descriptors and finally call exec(). That exec() syscall replaces the child's memory with an image of the executable to be executed.

Since fork() needs to provide a memory copy, it was originally slow and expensive (there were no copy-on-write hardware features). Therefore, vfork() was provided: here, the parent and child process operate on the same memory. That is, modifications to memory done by the child affect the parent and vice versa. Because it's dangerous for two threads to operate on the same stack, vfork() blocks the spawner thread (in the parent process) until the child has either executed exec() or exit(). The exec() syscall stops memory sharing, in other words, the child process will have its own memory after exec().

In Linux, there's neither a fork() nor a vfork() system call. Instead, both features are provided via flags to the clone() system call: CLONE_VFORK and CLONE_VM. CLONE_VFORK implements the blocking aspect of vfork(), that is, it blocks the spawner thread until the child calls exec() or exit(). CLONE_VM controls the sharing of memory: if set, both processes operate on the same memory (until exec()); if unset, the child will have a (copy-on-write) clone of the memory of the parent.

glibc according to the comment is mostly concerned about the child clobbering the stack of the parent. To avoid this, it allocates a new stack for the child and uses clone() with CLONE_VM | CLONE_VFORK. The clone() syscall provides an additional parameter where you can set the stack of the child. For vfork() proper, that would be the stack of the parent; glibc here places its dedicated allocation.

Any signal handler executed in the child also operate on the memory of the parent, but the state of the child process is a bit strange:

The threads of the parent process are not part of the same thread group (process), yet they access the same data and locks.
The thread ID and process ID of the parent do not fit to the thread-local storage of the child (glibc shares the TLS between parent and child by not setting CLONE_SETTLS).
The child shares the memory with the parent, but other resources like file descriptors are not shared. File descriptors specifically are only inherited - meaning that closing a file descriptor only affects the child, for example.
If the parent (thread) receives a signal while waiting for the child in clone(CLONE_VFORK), it will only process that signal after the child exit()s or exec()s. (This is documented in the man page for vfork() but not in the one for clone().)

Signals can either be sent to the child directly (for whatever reason), occur within the child (e.g. SIGSEGV or SIGTTIN) or affect an entire process group (e.g. SIGINT). Some of these aren't entirely under the control of the code between clone() and exec() (like SIGINT and SIGTTIN), which explains why the signals should not be handled in the child using the existing signal handlers of the parent. posix_spawn() therefore restores all signal handlers to their default disposition in the child.

This still doesn't quite explains why signals are blocked in the parent, though. I assume glibc wants to avoid a race condition where a signal can arrive in the child in between the clone() and the set-up of the signal handlers in the child. Glibc does restore the signal mask after resetting the signal handlers in the child (and after performing the posix_spawn() file actions). Recently, the kernel added another flag CLONE_CLEAR_SIGHAND for clone(), which at least takes care of setting the signal disposition / uninstalling any custom signal handlers. I wonder if this means you could also get rid of the signal blocking (in the parent).

Since we have allocated a stack for the child, we also need to free it. And it has been allocated in the memory space of the parent; the child cannot free it (easily) because it runs on that stack up until exec() which might even return on failure. So the parent needs to wait until the child is done (exec() successful or exit() called) before releasing the stack allocation. I'm not sure why waitpid() was not used as the synchronization mechanism, the comment in glibc merely states that this synchronization is the reason for using CLONE_VFORK.

Why do glibc's `system()` and `posix_spawn()` deal with signals?