Checkpoint/Restart (was "no subject - file transmission")

Mike Muuss mike at BRL.MIL
Tue Aug 21 14:25:26 AEST 1990


>> And I remember people bragging about how cheap and small Unix
>> processes were. How things have changed.

UNIX processes still are pretty cheap, compared to more "traditional"
operating systems (like OS/360).  The real source of difficulty
in checkpoint/restart comes from interfaces to "stateful" resources,
like:

*)  Tape drives.  Need to get the right reel back, in the right position.
And hope that no other application or user has modified the tape
in the interval between checkpoint and restart.

*)  Terminals.  All the terminal modes should be saved and restored.
What about other processes that might have come along in the meantime
and started using the terminal, on restart?

*)  Network connections.  The system can't keep the connection open
while it's down.  In general, it is not possible for the operating system
to know how to restore the state of a network connection.  Even saving
the entire output stream and re-sending is not likely to have the
right result.

*)  Temporary files.  If the process depends on files in /tmp (which
may or may not be open at the instant that the checkpoint is taken),
and the system has a policy of clearing /tmp on reboot, then trouble
will result.

Therefore, I assert that it is the state of the I/O system, not the state
of the UNIX processes, that is hard to checkpoint.  Indeed, it is trivial
to checkpoint file pointers, PID's, and other aspects of the *process*
state.  It isn't too hard to make sure that files have not changed
between checkpoint and restart times.

So, please don't bash the UNIX Process concept.  Checkpoint/restart
in any non-trivial I/O environment is *hard*.

Cray Research has been rather successful in implementing checkpoint/
restart in their UNICOS version of UNIX.  I believe that they have
reported on this work, but offhand I don't have any references.

	Best,
	 -Mike Muuss



More information about the Comp.unix.wizards mailing list