clinuxprocesswaitpid

waitpid() function returns ERROR (-1), why?


I'm writing a Linux shell-like program in C.

Among others, I'm implementing two built-in commands: jobs, history. In jobs, I print the list of currently working commands (in the background). In history I print the list of all commands history until now, specifying for each command if it's RUNNING or DONE.

To implement the two, my idea was to have a list of commands, mapping the command name to their PID. Once the jobs/history command is called, I run through them, check which ones are running or done, and print accordingly.

I read online that the function: waitpid(pid, &status, WNOHANG), can detect from "PID" whether a process is still running or done, without stopping the process. It works well, except for this:

When a program is alive, the function returns it. When a program is done, the first time I call it returns done, and from there on, if called again with the same PID, it returns -1 (ERROR).

For example, it would look like this: (the & symbolizes background command)

$ sleep 3 &
$ jobs
sleep ALIVE 
$ jobs  (withing the 3 seconds)
sleep ALIVE
$ jobs (after 3 seconds)
sleep DONE
$ jobs 
sleep ERROR
$ jobs 
sleep ERROR
....

Also, these are not influenced by other command calls I might do before or after, it seems the behavior described above is independent of other commands.

I read online various reasons why waitpid might return -1, but I wasn't able to identify the reason in my case. Also, I tried looking for how to understand what type of waitpid error is it, but again unsuccessfully.

My questions are:

  1. Why do you think this behavior is happening
  2. If you have a solution (the ideal thing would it for it to keep returning DONE)
  3. If you have a better idea of how to implement the jobs/history command is well accepted

One solution for this problem is that as soon as I get "DONE", I sign the command as DONE, and don't perform the waitid anymore on it before printing it. This would solve the issue, but I would remain in the dark as to WHY is this happening


Solution

  • You should familiarize yourself with how child processes are handled on Unix environments. In particular read about Zombie processes.

    When a process dies, it enters a 'zombie' state, so that its PID is still reserved and uniquely identifies the now-dead process. A successful wait on a zombie process frees up the process descriptor and its PID. Consequently subsequent calls to wait on the same PID will fail cause there's no more process with that PID (unless a new process is allocated the same PID, in which case waiting on it would be a logical error).

    You should restructure your program so that if a wait is successful and reports that a process is DONE, you record that information in your own data structure and never call wait on that PID again.

    For comparison, once a process is done, bourne shell reports it one last time and then removes it from the list of jobs:

    $ sleep 10 &
    $ jobs
    [1] + Running                 sleep 10
    $ jobs
    [1] + Running                 sleep 10
    $ jobs
    [1]   Done                    sleep 10
    $ jobs
    $