Checkpoint clarification

David Gundlach david at rolf.stat.uga.edu
Wed Aug 22 12:04:03 AEST 1990


Hello all

I am a co-worker of the guy that got you started on all of this, 
and I would like to clarify (maybe) some stuff.

We are the System Support for the Statistics department at the 
University of Georgia.  We have many professors who know some 
Fortran77 and wish to do number crunching.  They have worked with 
computers for quite a while, but not from a programmer's angle-- 
they simply write the program that will crunch their arrays (or 
whatever).

I am an undergraduate CS major, and Mark has just entered grad 
school here (he got his CS BS).  I don't know much at all about 
the kernel and job control (barely enough to be dangerous), and 
Mark can only put in a few hours a week.  Thus, neither of us can 
take an application and port it to C and write in the checkpointing 
functions.  Unfortunately, not every 'developer of long-running 
applications' (I think Dr. Hutcheson would like that title :-) can 
really be computer literate.

This is why we need something already written.

We got quite a few answers, and Mark's thank you is below.  Our one 
comfort is that these applications, by their very nature, exist almost 
entirely in ram throughout their execution.  With that, we may be 
able to proceed.


David Gundlach				david at rolf.stat.uga.edu
UGA Statistics				gundlach at csun2.cs.uga.edu
University of Georgia			404/542-3289 or 404/542-5232

"I'm a reasonably good speller, but a lousy typist."	- me

begin included thank you
------------------------

I got a dizzying array of responses to my question that can basically 
be summarized as:
		1) it can't be done
		2) write your own for each job 
		3) use Condor

Of course everyone is right... amazing how slippery is the TRUTH.

Below please find MY (in other words what I think they said)
synthesis of the e-mail I got:

There are a number of SPECIAL problems like network connections, 
unofficial serial devices, etc. that SHOULD never be handled 
by the computer (hand-waving works with people not computers).

By in large, however, the run of the mill number cruncher with 
input file(s) and output file(s) can be put in suspended  
animation and later awakened without incurring the Wrath of
Khan--er, Kernel. 

Condor is the slick way of doing this:

		1) it's been written 
		2) it was written well
		3) it works somewhere already (shorty.wisc.cs.edu)

Incidentally Condor is capable of spreading system load around and 
other P9-ishness.

So there you have it.  Thank you for taking some time out of your
busy schedule to reply to me.

			mark

			mth at rolf.stat.uga.edu



More information about the Comp.unix.wizards mailing list