I'm writing a Linux shell-like program in C.
Among others, I'm implementing two built-in commands: jobs, history.
In jobs
, I print the list of currently working commands (in the background).
In history
I print the list of all commands history until now, specifying for each command if it's RUNNING or DONE.
To implement the two, my idea was to have a list of commands, mapping the command name to their PID. Once the jobs/history command is called, I run through them, check which ones are running or done, and print accordingly.
I read online that the function: waitpid(pid, &status, WNOHANG)
, can detect from "PID" whether a process is still running or done, without stopping the process.
It works well, except for this:
When a program is alive, the function returns it. When a program is done, the first time I call it returns done, and from there on, if called again with the same PID, it returns -1 (ERROR).
For example, it would look like this: (the & symbolizes background command)
$ sleep 3 &
$ jobs
sleep ALIVE
$ jobs (withing the 3 seconds)
sleep ALIVE
$ jobs (after 3 seconds)
sleep DONE
$ jobs
sleep ERROR
$ jobs
sleep ERROR
....
Also, these are not influenced by other command calls I might do before or after, it seems the behavior described above is independent of other commands.
I read online various reasons why waitpid
might return -1, but I wasn't able to identify the reason in my case. Also, I tried looking for how to understand what type of waitpid
error is it, but again unsuccessfully.
My questions are:
One solution for this problem is that as soon as I get "DONE", I sign the command as DONE, and don't perform the waitid
anymore on it before printing it. This would solve the issue, but I would remain in the dark as to WHY is this happening
You should familiarize yourself with how child processes are handled on Unix environments. In particular read about Zombie processes.
When a process dies, it enters a 'zombie' state, so that its PID is still reserved and uniquely identifies the now-dead process. A successful wait
on a zombie process frees up the process descriptor and its PID. Consequently subsequent calls to wait
on the same PID will fail cause there's no more process with that PID (unless a new process is allocated the same PID, in which case waiting on it would be a logical error).
You should restructure your program so that if a wait
is successful and reports that a process is DONE
, you record that information in your own data structure and never call wait
on that PID again.
For comparison, once a process is done, bourne shell reports it one last time and then removes it from the list of jobs:
$ sleep 10 &
$ jobs
[1] + Running sleep 10
$ jobs
[1] + Running sleep 10
$ jobs
[1] Done sleep 10
$ jobs
$