linuxasynchronousconcurrencyprocesswaitpid

How can I know if all processes in a process group are collected?


I'm learning signals and writing a simple shell program. I run into a problem that the shell needs to wait for a foreground job to end, which may consist of multiple processes. It seems that I need to use waitpid to 'wait' all processes in the process group of the job. I first came out the following code:

pid_t pid;
while ((pid = waitpid(-pgid, NULL, 0)) > 0) {
    // Do some work
}
if (errno == ECHILD) {
    // Now we know all processes in the group have finished
}

But then I started suspecting: since a process id can be reused after the process dies, and I think so can a process group id, chances are that after the last process in the group is waited by the while loop and before the next loop starts, a new process is borne whose process group id is also pgid. In this case, the loop would continue to wait for the new process, although it actually doesn't belong to the former group. In fact, I think the use of waitpid can't solve the problem. After the last waitpid, I need to use it again to check if there is more processes of the same pgid. Between these two call of waitpid, new processes with the same pgid may be borne.

Later, noticing that a child subreaper is used to deal with children and grandchildren, I came up an idea that I can use process A to exclusively deal with the foreground job, while using process B to deal with all other processes, so that any new processes (belonging to process B) won't interfere process A's waitpid. I still don't know whether this method is feasible and whether there would be further problems.

My questions are, 1) is there any simple way to solve the problem? 2) or some complicated method? 3) or, is my concern about the problem unnecessary? 4) or, the problem actually won't happen?


Solution

  • I run into a problem that the shell needs to wait for a foreground job to end, which may consist of multiple processes. It seems that I need to use waitpid to 'wait' all processes in the process group of the job.

    That's not really feasible by relying only on POSIX features, and POSIX does not specify it.

    Try compiling this program and running it in the foreground via Bash (for example):

    #include <unistd.h>
    #include <stdlib.h>
    
    int main(void) {
        pid_t child_pid = fork();
    
        if (child_pid == 0) {
            sleep(10);
        }
    }
    

    Observe that the shell does not wait for the child to finish, but instead returns itself to the foreground and presents a new command prompt almost immediately, while the child (its grandchild) is still running. Some systems, including Linux, do provide features that would allow the shell to wait for all processes in the process group to complete, but that's not what shells actually do.

    Most of the following focuses on this POSIX view of the world.

    I first came out the following code:

    pid_t pid;
    while ((pid = waitpid(-pgid, NULL, 0)) > 0) {
        // Do some work
    }
    if (errno == ECHILD) {
        // Now we know all processes in the group have finished
    }
    

    Without help, that does not work reliably to wait for all processes of a process group to terminate, because a process other than pid 1 can wait only for its own children -- not its grandchildren or more distant descendants. If a process outlives its parent then ability and responsibility to collect it falls to pid 1.

    Yet that is typical of what a shell will do. It will collect all of the shell's children that are members of the specified process group. Each child is responsible for collecting its own children, and if any are orphaned then they pass outside the scope of the shell's job control.

    But then I started suspecting: since a process id can be reused after the process dies, and I think so can a process group id,

    This is true, but moot. Process group ids for new process groups are assigned as the pid of the first process in the group (as far as POSIX is concerned). PGIDs can indeed be reused, because pids can be reused. But neither pid reuse nor pgid reuse is a problem for waitpid(), because the system does not rely on these to determine process parent / child relationships, and waitpid() collects only children of the calling process.

    Moreover, pids increase strictly until they wrap around, so there is ordinarily a considerable delay before reuse.

    chances are that after the last process in the group is waited by the while loop and before the next loop starts, a new process is borne whose process group id is also pgid.

    If by "chances are" you mean "there is a chance", then yes. But if you mean "it's likely" then no, not at all. Not even if the pid numbers have wrapped around since the original process group was created. And even if that happens, it's not a problem for your particular code, because the processes in the new process group will not be subject to your waitpid() call, pgid notwithstanding.

    In this case, the loop would continue to wait for the new process, although it actually doesn't belong to the former group.

    No. A process can wait only for its own children. You can limit which of them to wait for to those belonging to a particular process group, but unless you are PID 1, you cannot wait for processes that are not your children.

    In fact, I think the use of waitpid can't solve the problem.

    waitpid() does not have the problem you think it does.

    Later, noticing that a child subreaper is used to deal with children and grandchildren, [...]

    Subreapers are a Linux-specific feature that is not ordinarily leveraged for shell job control (even on Linux). But if you did use it then the most natural mode of doing so would be for your shell to set itself as a subreaper, in which case its orphaned descendants would fall on it to collect. That would allow your original waitpid() to collect all descendants, as you seem to have supposed it would anyway, but would not enable it to collect any processes that were not its descendants, no matter what their pgid.

    My questions are, 1) is there any simple way to solve the problem?

    What you are already doing already serves the purpose of collecting all members of the process group of a job launched by the shell, to the extent that shells ordinarily do that.

    or some complicated method?

    On Linux, if you want to collect orphaned descendants as well then you could use prctl() with PR_SET_CHILD_SUBREAPER to make an instance of your shell a subreaper, and then proceed as you already were doing. Not much more complicated, really. But you shouldn't, because that's not how shells ordinarily behave.

    1. or, is my concern about the problem unnecessary? 4) or, the problem actually won't happen?

    Your specific concern that your shell might try to wait on processes that it should not is unfounded. pid and / or pgid reuse will not produce such an effect.

    The Glibc manual has a detailed discussion about implementing a job-control shell. You might find it helpful.