Zombies

Fri Feb 16 11:52:35 AEST 1990

I wrote this in earlier days, and it probably has screwed up some fine points,
but I think it will explain what's happening:

________________________________________________________________________

1.  Overview_of_SIGCLD

The SIGCLD or death of child signal is used to notify a
parent that a child process has died.  That is simple
enough.  What is not apparent from the manual pages is how
death of child can be used to affect the status of the child
process after it has died, or how, along with the wait
system call, useful information can be obtained about that
child process.

Under normal circumstances, when the child process dies, it
goes into the zombie state, which is to say that everything
has been cleaned up except the useful information located in
the process table, which remains until the parent itself
dies.  At that point, the child process information is
cleaned up from the process table.

If death of child is set by the parent to be ignored, then
when the child process dies, the child does NOT enter this
zombie state.  Instead, all the information in the process
table is cleaned up.

2.  Overview_of_WAIT

The wait() system call gathers information about zombie
processes from the process table.  If no such zombies exist,
then the call returns -1.  If a child exists but has not
entered the zombie state, then the call blocks.  If a child
exists, but has entered the zombie state, then two things
are returned:

 1.  The process id of the zombie.

 2.  Information about why the process stopped.

Along with this, the zombie is cleaned out of the process
table.

If more than one zombie exists, then successive calls to
wait() will retrieve the information for those zombies.

3.  SIGCLD_and_WAIT

In light of the above, consider what happens when death of
child is being ignored, and the wait call has been invoked
on one live child process, which subsequently dies and
becomes a zombie.  The result is that when the signal
arrives during the blocked wait() The kernel first cleans up
that zombie, and then wakes up the wait() call.  Wait() now
sees that there are no children to wait for, and returns -1.

In the case of two live children, one of which dies and
enters the zombie state, when the signal arrives for the
blocked wait call, the kernal cleans up that one zombie, and
then wakes up the wait() call, which now sees that there is
still a live child being waited for, so it continues to
block.

The result of this is that if death of child is being
ignored, the wait() system call will block until ALL the
child processes are dead, and after that will return -1.
(except on the 3b4000, running 5.3.1 - it releases no matter
what).

The case where death of child is elected to be caught is
even more handy.

EX:	signal(SIGCLD, function_ptr)

When a child dies, a zombie appears in process table, 
death of child is sent by the kernal, and the signal handling
function is invoked.

Presumably, in that signal handling function, a call to
wait() can be made to see what process died, and why.  In this
way, child processes may be monitored and restarted if
necessary.  Notice that you don't have to be ready, or
synchronize this in any way.  You can take your time in
issuing the wait() call.  The zombie will hang around until
you are ready to process it.

4.  REFERENCES

 1.  UNIX Programmer's Reference Manual

 2.  The Design of the UNIX Operating System - Maurice J.
     Bach,  pp. 210, top of page, and pp. 213-216

*******

I have a utility function that uses this to enable a sort of private
'inittab'.  It will fork/exec a process (or just a plain function, with
a 'fork' only) and monitor these children, respawning them if they died.
(It also makes children immune to kill -9, since they don't have to
notify the parent when they are dying).

If you're interested, I'll send a copy.

GB