Checkpoint/Restart

Sun Aug 19 04:56:33 AEST 1990

In article <17543 at ucsd.Edu> gkn at ucsd.Edu (Gerard K. Newman) writes:

   I think it's a bit unfair for every user of a system to have to
   invent a way to do this specific to their particular application.
   In many cases it may not be possible (the above "canned software"
   problem being an example).

I would agree with the above statements if

	a) the effort of creating a programmer/user-transparent
	general-purpose solution was not much more difficult than writing a
	programmer/user-visible application-specific solution,

	b) it was impossible (nearly so) to create application-specific
	solutions to the problem, or

	c) most applications actually needed it.

However, as has been discussed in this and other newsgroups off-and-on over
the past couple of years

	a) it is very hard to solve the general-purpose problem, systems
	like CRAY's checkpoint/restart facility, and the University of
	Wisconsin's RU/Condor systems notwithstanding,

	b) for most applications that need such facilities, they aren't
	terribly difficult to write,

	c) very few applications actually need such facilities.

Given the difficulty of adding a general solution to (various flavors of)
Unix, it is probably wiser to do it on an case-by-case basis. It is unlikely
that most of the relatively few applications that need checkpoint/restart
capabilities will need the full range of capabilities that will need to be
accounted for in a general solution.

As a common case, consider many scientific applications. They typically read
in a large data set, munch on it in an iterative manner for a long period of
time, then write out another large data set. Checkpointing an application of
this sort is pretty trivial. Just write out the intermediate state of the
computation "every so often". If it must be restarted, it can be directed to
read the checkpointed data, restarting the computation from that point.

If the application crashes during the initial input phase, no expensive
computation has been lost. There's a checkpoint facility in place during the
iterative solve phase. During the final output phase, if an error occurs
(such as a full disk, head crash, or system failure), you fall back to the
last checkpoint during the compute phase (if you can recover it from the
disk).

Another example is text editors. Most editors I've used over the past
several years (Emacs of several flavors, vi, EDT), provided some sort of
checkpoint or playback facilities. (EDT's playback was fun to watch.)

As to the second point (canned software packages), checkpoint/restart
capabilities should be treated as a competitive advantage of one package
over another. If your vendor(s) don't provide such facilities, and you need
them, lean on them.  If there's a vendor that does, factor that into your
evaluation. They won't provide it until they realize you need it. The best
way to get them to realize it is with your pocketbook.

--
Skip (montanaro at crdgw1.ge.com)